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BOX PCT 

IN THE UNITED STATES DESIGNATED/ELECTED OFFICE 
OF THE UNITED STATES PATENT AND TRADEMARK OFFICE 
UNDER THE PATENT COOPERATION TREATY - CHAPTER II 

AMENDMENT "A" PRIOR TO ACTION AND 
SUBMISSION OF SUBSTITUTE SPECIFICATION 

APPLICANT(S): NEUNEIER, R., et al. 

ATTORNEY DOCKET NO: P01 ,0020 

INTERNATIONAL APPLICATION NO: PCT/DE99/02846 

INTERNATIONAL FILING DATE: 8 SEP 1999 

INVENTION: METHOD AND ARRANGEMENT FOR 

DETERMINING A SEQUENCE OF 
ACTIONS FOR A SYSTEM 

Assistant Commissioner for Patents 
Washington, DC 20231 

Sir: 

Applicants herewith submit an amendment and substitute specification in 
the captioned PCT application, and respectfully request entry of same prior to 
examination in the United States National Stage. 

IN THE SPECIFICATION 

Cancel the specification as filed and insert therefore the substitute 
specification provided herewith. 

IN THE CLAIMS 

Cancel claims 1 - 21 as filed and insert therefore new claims 22 - 42 as 
follows: 
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- - What is claimed is: 

22. (New) A method for computer-aided determination of a sequence of 
actions for a system having states, the method comprising the steps of: 

performing a transition in state between two states on the basis of an 

action; 

determining the sequence of actions to be performed such that a 
sequence of states results from the sequence of actions; 

optimizing the sequence of steps with regard to a prescribed optimization 
function, including a variable parameter; and 

using the variable parameter to set a risk which the resulting sequence of 
states has with respect to a prescribed state of the system. 

23. (New) The method as claimed in claim 22, further comprising the 
step of: 

using approximative dynamic programming for the purpose of 
determination. 

24. (New) The method as claimed in claim 23, further comprising the 
step of: 

basing the approximative dynamic programming Q-learning. 
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25. (New) The method as claimed in claim 24, further comprising 
the steps of: 

forming an optimization function within Q-fearning in accordance with the 
following rule: 

OFQ = Q(x; w a ) , 



adapting weights of the function approximator in accordance with the 
following rule: 



and 




= w t T: + lit N 



wherein 



d t = r ( x t' a t' x t + l) + 



y max cjx t + i, w|j - Q^x tf « t 1 



26. (New) The method as claimed in claim 23, further comprising the 
step of: 

basing the approximative dynamic programming on TD().)-leaming. 
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27. (New) The method as claimed in claim 26, further comprising the 
steps of: 

forming the optimization function within TD(^)-learning in accordance with 
the following rule: 
OFTD = J(x;w); and 

adapting weights of the function approximator are adapted in accordance 
with the following rule: 

w t+ i = w t + tit ■ K K (d t ) ■ z t , wherein d t = r(w t , a t , x t+1 ) + yJ(x t+ i; w t ) - J(x t ; w t ), z t = X ■ 
y • zn + VJ(x t ; w t ), and z_i =0. 

28. (New) The method as claimed in claim 27, further comprising the 
step of: 

using a technical system to determine the sequence of actions before the 
determination measured values are measured. 

29. (New) The method as claimed in claim 28, further comprising the 
step of: 

subjecting the technical system to open-loop control in accordance with 
the sequence of actions. 
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30. (New) The method as claimed in claim 28, further comprising the 
step of: 

subjecting the technical system to closed-loop control in accordance with 
the sequence of actions. 

31 . (New) The method as claimed in claim 30, further comprising the 
step of: 

modeling the system as a Markov Decision Problem. 

32. (New) The method as claimed in claim 31 , further comprising the 
step of: 

using the system in a traffic management system. 

33. (New) The method as claimed in claim 31 , further comprising the 
step of: 

using the system in a communications system. 

34. (New) The method as claimed in claim 31 , further comprising the 
step of: 

using the system to carry out access control in a communications network. 
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35. (New) The method as claimed in claim 31 , further comprising the 
step of: 

using the system to carry out routing in a communications network. 

36. (New) A system for determining a sequence of actions for a system 
having states, wherein a transition in state between two states is performed on 
the basis of an action, the system comprising: 

a processor for determining a sequence of actions, whereby a sequence 
of states resulting from the sequence of actions is optimized with regard to a 
prescribed optimization function, and the optimization function includes a variable 
parameter for setting a risk which the resulting sequence of states has with 
respect to a prescribed state of the system. 

37. (New) The system as claimed in claim 36, wherein the processor is 
used to subject a technical system to open-loop control. 

38. (New) The system as claimed in claim 36, wherein the processor is 
used to subject a technical system to closed-loop control. - - 

39. The system as claimed in claim 36, wherein the processor is used 
in a traffic management system. 
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40. The system as claimed in claim 36, wherein the processor is used 
in a communications system. 

41 . The system as claimed in claim 36, wherein the processor is used 
to carry out access control in a communications network. 

42. The system as claimed in claim 36, wherein the processor is used 
to carry out routing in a communications network. - - 

IN THE ABSTRACT 

Cancel the Abstract as filed, and insert therefore on a separate page, the 
following Abstract of the disclosure: 

- - ABSTRACT OF THE DISCLOSURE 

A determination of a sequence of actions is performed such that a 
sequence of states resulting from the sequence of actions is optimized using a 
prescribed optimization function. The optimization function includes a variable 
parameter with which it is possible to establish a risk relating to the resulting 
sequence of states based upon a prescribed state of the system. - - 
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REMARKS 



A substitute specification and a proper abstract of the disclosure are 
provided herewith which make editorial changes in order to conform to standard 
US practice. A marked-up copy of the specification is also provided reflecting the 
changes made. 

In addition, the claims as filed have been canceled and replaced by new 
claims that more clearly set forth the subject matter of Applicants' invention. 

No new matter has been inserted into the application. 

Applicants submit that this application is in proper condition for 
examination in the United States National Stage, which action is earnestly 
solicited. 



Steven H. Noll (Reg. No. 28,982) 

SCHIFF HARDIN & WAITE 
Patent Department 
6600 Sears Tower 
233 South Wacker Drive 
Chicago, IL 60606 
Telephone: (312) 258-5790 
Attorneys for Applicant(s) 

Customer Number: 26574 



Respectfully submitted, 
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INTERNATIONAL APPLICATION NO: 
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Assistant Commissioner for Patents 
Washington, DC 20231 

Sir: 

Applicants herewith submit four drawing sheets, showing Figures 1 - 6, in 
the captioned PCT application. 



Respectfully submitted, 

Steven H. Noll (Reg. No. 28,982) 
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Substitute specification: 

- - METHOD AND ARRANGEMENT FOR DETERMINING 
A SEQUENCE OF ACTIONS FOR A SYSTEM 

BACKGROUND OF THE INVENTION 

Field of the Invention: 

This invention generally pertains to systems having states, and in particular to 
methods for determining a sequence of actions for such systems. 

Discussion of the Related Art: 

A generalized method and arrangement for determining a sequence of 
actions for a system having states, wherein a transition in state between two states 
is performed on the basis of an action, is discussed by Neuneier in "Enhancing Q- 
Learning for Optimal Asset Allocation", appearing in the Proceedings of the Neural 
Information Processing Systems, NIPS 1997. Neuneier describes a financial market 
as an example of a system which has states. His system is described as a Markov 
Decision Problem (MDP). 

The characteristics of a Markov Decision Problem are represented below by 
way of summary: 

x set of possible states of the system, 

e.g. X = m m , 

A ( x 0 set of possible actions in the state 



p(x t+ i I x t ,a t ) 
r(x tt a t , x t+ i) 



x t 

gain with expectation R(x t , a t ). 



Starting from observable variables, the variables denoted below as training 
data, the aim is to determine a strategy, that is to say a sequence of functions 



which at each instant t map each state into an action rule, that is to say action 



Such a strategy is evaluated by an optimization function. 

The optimization function specifies the expectation, the gains accumulated 
over time at a given strategy n, and a start state x 0 . 

The so-called Q-learning method is described by Neuneier as an example of 
a method of approximative dynamic programming. 

An optimum evaluation function V*(x) is defined by 
V*(x) = max V*(x) VxeX (5) 




(3) 



where 



00 



V*(x) 



E £ y t r(x t ' Ht* x t + l)| x O = * 



(6) 



.t = 0 



y denoting a prescribable reduction factor which is formed in accordance with the 
following rule: 

z e SR + . (8) 



A Q-evaluation function Q*(x t ,a t ) is formed within the Q-learning method for 
each pair (state x t , action a t ) in accordance with the following rule: 

Q*(x t , a t ): = £ p(xt + lk' at) ■ r t + 

X€X 

+y • S p( x | x t' a t) • max(<2*(x, a)) 
X<=X a€ A v 

(9) 

On the basis respectively of the tupel (x t , x t+ i, a t) r t ), the Q-values Q* (x,a) are 
adapted in the k+1 th iteration in accordance with the following learning rule with a 
prescribed learning rate % in accordance with the following rule: 

Qk + l( x f a t) = (l " ?lk)Qk( x t' a t) + Tlk^t + Y max(Q k (x t + 1 , a))j . (10) 

Usually, the so-called Q-values Q*(x,a) are approximated for various actions by 
a function approximator in each case, for example a neural network or a polynomial 
classifier, with a weighting vector w a , which contains weights of the function 
approximator. 

A function approximator is, for example, a neural network, a polynomial 
classifier or a combination of a neural network with a polynomial classifier. 
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It therefore holds that: 

Q*(x, a) * q|x; w a j . 



(11) 



Changes in the weights in the weighting vector w 3 are based on a temporal 
difference d t which is formed in accordance with the following rule: 



The following adaptation rule for the weights of the neural network, which are 
included in the weighting vector w a , follows for the Q-learning method with the use of 
a neural network: 



The neural network representing the system of a financial market as described 
by Neuneier is trained using the training data which describe information on changes 
in prices on a financial market as time series values. 

A further method of approximative dynamic programming is the so-cailed TD(X) 
learning method. This method is discussed in R.S. Sutton's, "Learning To Predict By 
The Method Of Temporal Differences", appearing in Machine Learning, Chapter 3, 
pages 9 -44, 1988. 

Furthermore, it is known from M. Heger's, "Risk and Reinforcement Learning: 



d t : = r(x t , a t , x t + i) + y max ofat + V w k ; 



!)- Q (x t ;w^) 



(12) 




(13) 
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Concepts and Dynamic Programming", ZKW Bericht No. 8/94, Zentrum fur 
Kognitionswissenschaften [Center for Cognitive Sciences], Bremen University, 
December 1994, that risk is associated with a strategy n and an initial 
state x t . A method for risk avoidance is also discussed by Hager, cited above. 

The following optimization function, which is also referred to as an expanded 
Q-function Q n (x t , a t ), is used in the Hager method: 



The expanded Q-function Q*(x t) a t ) describes the worst case if the action at is 
executed in the state x t and the strategy n is followed thereupon. 

The optimization function Q*(x t , a t ) for 



maximize 




(14) 



Q*(x t , a t ): = max Q*(x t , a t) 



(15) 



is given by the following rule: 




min 
xeX 




(16) 
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A substantial disadvantage of this mode of procedure is that only the worst 
case is taken into account when finding the strategy. However, this inadequately 
reflects the requirements of the most varied technical systems. 

In "Dynamic Programming and Optimal Control", Athena Scientific, Belmont, 
MA, 1995, D.P. Bertsekas formulates access control for a communications network 
and routing within the communications network as a problem of dynamic 
programming. 

Therefore, the present invention is based on the problem of specifying a 
method and system for determining a sequence of actions in which the method or 
sequences of actions achieve an increased flexibility in determining the strategy 
needed. 

In a method for computer-aided determination of a sequence of actions for a 
system which has states, a transition in state between two states being performed 
on the basis of an action, the determination of the sequence of actions is performed 
in such a way that a sequence of states resulting from the sequence of actions is 
optimized with regard to a prescribed optimization function, the optimization function 
including a variable parameter with the aid of which it is possible to set a risk which 
the resulting sequence of states has with respect to a prescribed state of the system. 

A system for determining a sequence of actions for a system which has 
states, a transition in state between two states being performed on the basis of an 
action, has a processor which is set up in such a way that the determination of the 
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sequence of actions can be performed in such a way that a sequence of states 
resulting from the sequence of actions is optimized with regard to a prescribed 
optimization function, the optimization function including a variable parameter with 
the aid of which it is possible to set a risk which the resulting sequence of states has 
with respect to a prescribed state of the system. 

Thus, the present invention offers a method for determining a sequence of 
actions at a freely prescribable level of accuracy when finding a strategy for a 
possible closed-loop control or open-loop control of the system, in general for 
influencing it. Hence, the embodiments described below are valid both for the 
method and for the system. 

Approximative dynamic programming is used for the purpose of determina- 
tion, for example a method based on Q-leaming or a method based on TD(A,)- 
learning. 

Within Q-learning, the optimization function OFQ is preferably formed in 
accordance with the following rule: 




• x denoting a state in a state space X 

• a denoting an action from an action space A, and 

• w a denoting the weights of a function approximator which belong to the action 
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The following adaptation step is executed during Q-leaming in order to 
determine the optimum weights w a of the function approximator: 

= w ^ + it • * K ( d t) • v Q^t-- «? t ) 

with the abbreviation 

d t = r(x t , a t/ x t + i) + T max ofxt + l' w t) " s( x t' "t*') 
aeA 

• x tl x t +1 respectively denoting a state in the state space X, 

• a t denoting an action from an action space A, 

• y denoting a prescribable reduction factor, 

• w| l denoting the weighting vector associated with the action a t before the 
adaptation step, 

• w t+i denoting the weighing vector associated with the action at after the 

adaptation step, 

• r)t (t = 1 , ...) denoting a prescribable step size sequence, 

• k e [-1; 1] denoting a risk monitoring parameter, 

• K K denoting a risk monitoring function K K (^) = (1 - Ksign(^))4, 

• VQ( ; ) denoting the derivation of the function approximator according to its 
weights, and 

• r(x t , a t , Xt+i) denoting a gain upon the transition of state from the state x t to the 
subsequent state x t +i. 

The optimization function is preferably formed in accordance with the following 
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rule within the TD(X)-learning method: 
OFTD = J(x;w) 

• x denoting a state in a state space X, 

• a denoting an action from an action space A, and 

• w denoting the weights of a function approximator. 

The following adaptation step is executed during TD(X)-learning in order to 
determine the optimum weights w of the function approximator: 

wt+i = w t + Tjf K K (d t )- z t 

with the abbreviations 

d t = r(w t , a t , x t+ i) + yJ(x t+ i; w t ) - J(x t ; w t ), 

z t = X- y zt-i + VJ(x t ; w t ), 

z.i = 0 

• xt, xt+i respectively denoting a state in the state space X, 

• a t denoting an action from an action space A, 

• y denoting a prescribable reduction factor, 

• w t denoting the weighting vector before the adaptation step, 

• wt+i denoting the weighting vector after the adaptation step, 

• T| t (t = 1 , ...) denoting a prescribable step size sequence, 

• k € [-1 ; 1] denoting a risk monitoring parameter, 

• K K denoting a risk monitoring function K K (£,) = (1 - Ksign(£))4, 

• VJ( ; ) denoting the derivation of the function approximator according to its 
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weights, and 

• r(x t , a t , x t+ i) denoting a gain upon the transition of state from the state x t to the 
subsequent state x t +i. 

SUMMARY OF THE INVENTION 

It is an object of the present invention to provide a technical system and 
method for determining a sequence of actions using measured values. 

!t is another object of the present invention to provide a technical system and 
method that can be subjected to open-loop control or closed-loop control with the 
use of a determined sequence of actions. 

It is a further object of the invention to provide a technical system and method 
modeled as a Markov Decision Problem. 

It is an additional object of the invention to provide a technical system and 
method that can be used in a traffic management system. 

It is yet another object of the invention to provide a technical system and 
method that can be used in a communications system, such that a sequence of 
actions is used to carry out access control, routing or path allocation. 

It is yet a further object of the invention to provide a technical system and 
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method for a financial market modeled by a Markov Decision Problem, wherein a 
change in an index of stocks, or a change in a rate of exchange on a foreign 
exchange market, makes it possible to intervene in the market in accordance with a 
sequence of determined actions. 

These and other objects of the invention will be apparent from a careful review 
of the following detailed description of the preferred embodiments, which is to read in 
conjunction with a review of the accompanying drawing figures. 

BRIEF DESCRIPTION OF THE DRAWINGS 

shows a flowchart of method steps according to the present invention; 
shows a system modeled as a Markov Decision Problem; 
shows a communications network wherein access control is carried out 
in a switching unit according to the present invention; 
shows a function approximator for approximative dynamic 
programming according to the present invention; 
shows a plurality of function approximators for approximative dynamic 
programming according to the present invention; and 
shows a traffic management system subjected to closed-loop control in 
accordance with the present invention. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

Figure 1 shows a flowchart according to the present invention, in which 



Figure 1 
Figure 2 
Figure 3 

Figure 4 

Figure 5 

Figure 6 



11 



individual method steps of a first embodiment are provided, which will be discussed 
later. 

Figure 2 shows the structure of a typical Markov Decision Problem method. 

The system 201 is in a state x t at an instant t. The state x t can be observed by 
an observer of the system. On the basis of an action a t from a set in the state x t of 
possible actions, a t e A(x t ), the system makes a transition with a certain probability 
into a subsequent state x t +1 at a subsequent instant t+1 . 

As illustrated diagrammatically in Figure 2 by a loop, an observer 200 
perceives 202 observable variables concerning the state x t and takes a decision via 
an action 203 with which it acts on the system 201 . The system 201 is usually 
subject to the interference 205. 

The observer 200 obtains a gain r t 204 
r t = r(x t , a t , x t + i) € 3t , (1) 

which is a function of the action a t 203 and the original state x t at the instant t as well 
as of the subsequent state x t +1 of the system at the subsequent instant t+1. 

The gain r t can assume a positive or negative scalar value depending on 
whether the decision leads, with regard to a prescribable criterion, to a positive or 
negative system development, to an increase in capital stock or to a loss. 



In a further time step, the observer 200 of the system 201 decides on the 
basis of the observable variables 202, 204 of the subsequent state x t+ i in favor of a 
new action a t +i, etc. 

A sequence of 
State: x t e X 

Action: a t e A(x t ) 

Subsequent state: x t +1 e X 

Gain r t = r(x t , a t , x t+ i) e <R 

describes a trajectory of the system which is evaluated by a performance criterion 
which accumulates the individual gains r t over the instants t. It is assumed by way of 
simplification in a Markov Decision Problem that the state x t and the action a t all 
contain information for the purpose of describing a transition probability p(x t+ i |- )of 
the system from the state x t to the subsequent state x t+ i. 

In formal terms, this means that: 

p( x t + lht/K , x 0 , a t ,K , a 0 ) = p(x t + 1 |x t , a t ) . (2) 

p(x t +i I x t ,a t ) denotes a transition probability for the subsequent state x t+ i for a given 
state x t and given action a t . 



In a Markov Decision Problem, future states of the system 201 are thus not a 
function of states and actions which lie further in the past than one time step. 



Figure 3 shows an embodiment of the present invention involving an access 
control and routing system, such as a communications network 300. 

The communications network 300 has a multiplicity of switching units 301a, 
301b, 301 i, ... 301 n, which are interconnected via connections 302a, 302b, 302j, 
... 302m. A first terminal 303 is connected to a first switching unit 301a. From the first 
terminal 303, the first switching unit 301a is sent a request message 304 which 
requests preservation of a prescribed bandwidth within the communications network 
300 for the purpose of transmitting data, such as video data or text data. 

It is determined in the first switching unit 301a in accordance with a strategy 
described below, whether the requested bandwidth is available in the communi- 
cations network 300 on a specified, requested connection instep 305. The request is 
refused instep 306 if this is not the case. If sufficient bandwidth is available, it is 
checked in checking step 307 whether the bandwidth can be reserved. 

The request is refused in step 308 if this is not the case. Otherwise, the first 
switching unit 301a selects a route from the first switching unit 301a via further 
switching units 301 i to a second terminal 309 with which the first terminal 303 wishes 
to communicate, and a connection is initialized in step 310. 

The starting point below is a communications network 300 which comprises a 
set of switching units 

N= {l,K , n,K ,N} < 17 > 
and a set of physical connections 



L= {l, K , 1, K , L} , 

a physical connection I having a capacity of B(l) bandwidth units. 



(18) 



A set 

M= {l, K , ra, K , M) (19) 

of different types of service m are available, a type of service m being characterized 
by 

• a bandwidth requirement b(m), 

• an average connection time — - — , and 

V(m) 

• a gain c(m) which is obtained whenever a call request of the corresponding 
type of service m is accepted. 

The gain c(m) is given by the amount of money which a network operator of 
the communications network 300 bills a subscriber for a connection of the type of 
service. Clearly, the gain c(m) reflects different priorities, which can be prescribed by 
the network operator and which he associates with different services. 

A physical connection 1 can simultaneously provide any desired combination 
of communications connections as long as the bandwidth used for the 
communications connections does not exceed the bandwidth available overall for the 
physical connection. 

If a new communications connection of type m is requested between a first 
node i and a second node j (terminals are also denoted as nodes), the requested 
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communications connection can, as represented above, either be accepted or be 
refused. If the communications connection is accepted, a route is selected from a set 
of prescribed routes. This selection is denoted as a routing. b(m) bandwidth units are 
used in the communications connection of type m for each physical connection along 
the selected route for the duration of the connection. 

Thus, during access control, also referred to as call admission control, a route 
can be selected within the communications network 300 only when the selected 
route has sufficient bandwidth available. The aim of the access control and of the 
routing is to maximize a long term gain which is obtained by acceptance of the 
requested connections. 

At an instant t, the technical system which is the communications network 300 
is in a state x t which is described by a list of routes via existing connections, by 
means of which lists it is shown how many connections of which type of service are 
using the respective routes at the instant t. 

Events w, by means of which a state x t could be transferred into a subsequent 
state x t+ i, are the arrival of new connection request messages, or else the termina- 
tion of a connection existing in the communications network 300. 

In this embodiment, an action a t at an instant t, owing to a connection request 
is the decision as to whether a connection request is to be accepted or refused and, 
if the connection is accepted, the selection of the route through the communications 
network 300. 
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The aim is to determine a sequence of actions, that is to say clearly to 
determine the learning of a strategy with actions relating to a state x t in such a way 
that the following rule is maximized: 



• E{.} denoting an expectation, 

• t k denoting an instant at which a kth event takes place, 

• g(x tk ,co k ,a tk ). denoting the gain which is associated with the kth event, and 

• p denoting a reduction factor which evaluates an immediate gain as being 
more valuable than a gain at instants lying further in the future. 

Different implementations of a strategy lead normally to different overall gains 

G: 



The aim is to maximize the expectation of the overall gain G in accordance 
with the following rule J: 



,k = 0 J 

it being possible to set a risk which reduces the overall gain G of a specific 
implementation of access control and of a routing strategy to below the expectation. 

The TD(X)-learning method is used to carry out the access control and the 




(20) 




(21) 




(22) 
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routing. 



The following target function is used in this embodiment: 




(23) 



• A denoting an action space with a prescribed number of actions which are 
respectively available in a state x t , 

• t denoting a first instant at which a first event co occurs, and 

• x t+ i denoting a subsequent state of the system. 

An approximated value of the target value J*(x t ) is learned and stored by 
employing a function approximator 400 (compare Figure 4) with the use of training 
data. 

Training data are data previously measured in the communications network 
300 and relating to the behavior of the communications network 300 in the case of 
incoming connection requests 304 and of termination of messages. This time 
sequence of states is stored, and these training data are used to train the function 
approximator 400 in accordance with the learning method described below. 

A number of connections of in each case one type of service m on a route of 
the communications network 300 serve in each case as input variable of the function 
approximator 400 for each input 401 , 402, 403 of the function approximator 400. 
These are represented in Figure 4 by blocks 404, 405, 406. An approximated target 
value J of the target value J* is the output variable of the function approximator 400. 
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Figure 5 shows a detailed representation of a function approximator 500, which 
has several component function approximators 510, 520. 



One output variable is the approximated target value j , which is formed in 
accordance with the following rule: 



The input variables of the component function approximators 510, 520, which 
are present at the inputs 51 1 , 512, 513 of the first component function approximator 
51 0, or at the inputs 521 , 522 and 523 of the second component function 
approximator 520 are, in turn, respectively a number of types of service of a type m 
in a physical connection r in each case, symbolized by blocks 514, 515, 516 for the 
first component function approximator, and 524, 525 and 526 for the second 
component function approximator 520. 

Component output variables 530, 531 , 532, 533 are fed to an adder unit 540, 
and the approximated target variable j is formed as output variable of the adder 
unit. 

Let it be assumed that the communications network 300 is in the state x tk and 
that a request message with which a type of service m of class m is requested for a 
connection between two nodes i, j reaches the first switching unit 301a. 




{24) 
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A list of permitted routes between the nodes i and j is denoted by R(i, j), and a 
list of all possible routes is denoted by 

R(i, j, x tlc ) c R(i, j) (25) 

as a subset of the routes R(i, j) which could implement a possible connection with 
regard to the available and requested bandwidth. 

For each possible route r, rsR(i, j,x tk ), a subsequent state 
x tk +i(x tk , a> k ,r) is determined which results from the fact that the connection 
request 304 is accepted and the connection on the route r is made available to the 
requesting first terminal 303. 

This is illustrated in Figure 1 as step 102, the state of the system and the 
respective event being respectively determined in step 101 . A route r* to be selected 
is determined in step 103 in accordance with the following rule: 

r* = arg max , j( x t k + l( x t k / <°k' r ) ©t) • (26) 
reR(i,j,x t] J 

A check is made in step 104 as to whether the following rule is fulfilled: 

c(m) + j(x tk+1 (x tk , co k , r*), © t ) < j(x tk ,0 t ). (27) 

If this is the case, the connection request 304 is rejected in step 105, otherwise 
the connection is accepted and "switched through" to the node j along the selected 
route r* in step 106. 
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Weights of the function approximator 400, 500 which are adapted in the 
TD(X)-learning method to the training data, are stored in a parameter vector 0 for an 
instant t in each case, such that an optimized access control and an optimized 
routing are achieved. 



During the training phase, the weighting parameters are adapted to the 
training data applied to the function approximator. 

A risk parameter k is defined with the aid of which a desired risk, which the 
system has with regard to a prescribed state owing to a sequence of actions and 
states, can be set in accordance with the following rules: 



-1 < k < 0: risky learning, 

k = 0: neutral learning with regard to the risk, 

0<k<1: risk-avoiding learning, 

k = 1 : worst-case learning. 

Furthermore, a prescribable parameter 0 < X < 1 and a step size sequence y k 
are prescribed in the learning method. 



The weighting values of the weighting vector 0 are adapted to the training 
data on the basis of each event co tk in accordance with the following adaptation 
rule: 

©k = ®k-i + rkK K (dkK * (28) 
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in which case 

d k = e-P^k-^-l^^cuk, a tk ) + j(x tk , e k _i)) - J^,,, 0 k -i) 

(29) 

z t = Xe-P( t ^-l- t k-2) 2t _ 1 + V Q j(x tk _ ;L# e k _ 1 ) f (30) 
and 

K K (^) = (l - KS ign(§))§. (31) 

It is assumed that: z.i = 0. 
The function 

g( x t k > <» k , a tk ) (32) 
denotes the immediate gain in accordance with the following rule: 

c(m) when co tk is a service request for a type of 
service m, and the connection is accepted 
0 otherwise 

(33) 

Thus, as described above, a sequence of actions is determined with regard to 
a connection request such that a connection request is either rejected or accepted 
on the basis of an action. The determination is performed taking account of an 
optimization function in which the risk can be set by means of a risk control 
parameter k e [-1; 1] in a variable fashion. 
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Figure 6 shows an embodiment of the present invention in relation to a traffic 
management system 

A road 600 on which automobiles 601, 602, 603, 604, 605 and 606 are being 
driven. Conductor loops 610, 611 integrated into the road 600 receive electric 
signals in a known way and feed the electric signals 615, 616 to a computer 620 via 
an input/output interface 621. In an analog-to-digital converter 622 connected to the 
input/output interface 621 , the electric signals are digitized into a time series and 
stored in a memory 623, which is connected by a bus 624 to the analog-to-digital 
converter 622 and a processor 625. Via the input/output interface 621 , a traffic 
management system 650 is fed control signals 651 from which it is possible to set a 
prescribed speed stipulation 652 in the traffic management system 650, or else 
further particulars of traffic regulations, which are displayed via the traffic 
management system 650 to drivers of the vehicles 601 , 602, 603, 604, 605 and 606. 

The following local state variables are used in this case for the purpose of 
traffic modeling: 

• traffic flow rate v, 

• vehicle density p (p = number of vehicles per kilometer — ). 

km 

Fz 

• traffic flow q (q = number of vehicles per hour — , (q= v * p)), and 

h 

• speed restrictions 652 displayed by the traffic management system 650 at an 
instant in each case. 



The local state variables are measured as described above by using the 
conductor loops 610, 61 1 . 

These variables (v(t), p(t), q(t)) therefore represent a state of the technical 
system of "traffic" at a specific instant t. 

In this embodiment, the system is therefore a traffic system which is 
controlled by using the traffic management system 650, and an extended Q-learning 
method is described as method of approximative dynamic programming. 

The state x t is described by a state vector 

x(t>= (v(t), p(t> q(t)) . (34) 

The action a t denotes the speed restriction 652, which is displayed at the 
instant t by the traffic management system 650. The gain r(x tl a t) x t+ i) describes the 
quality of the traffic flow which was measured between the instants t and t+1 by the 
conductor loops 610 and 61 1 . 

In this embodiment, r(x t , a t , x t+ i) denotes 

• the average speed of the vehicles in the time interval [t, t + 1] 
or 

• the number of vehicles which have passed the conductor loops 610 and 61 1 
in the time interval [t, t + 1] 

or 



• the variance of the vehicle speeds in the time interval [t, t + 1], 
or 

• a weighted sum from the above variables. 

A value of the optimization function OFQ is determined for each possible 
action a t , that is to say for each speed restriction which can be displayed by the 
traffic management system 650, an estimated value of the optimization function OFQ 
being realized in each case as a neural network. 

This results in a set of evaluation variables for the various actions a t in the 
system state x t . Those actions a t for which the maximum evaluation variable OFQ 
has been determined in the current system state x t are selected in a control phase 
from the possible actions a t , that is to say from the set of the speed restrictions which 
can be displayed by the traffic management system 650. 

In accordance with this embodiment, the adaptation rule, known from the Q- 
learning method, for calculating the optimization function OFQ is extended by a risk 
control function K K (.), which takes account of the risk. 

In turn, the risk control parameter k is prescribed in accordance with the 
strategy from the first exemplary embodiment in the interval of [-1 < k < 1], and 
represents the risk which a user wishes to run in the application with regard to the 
control strategy to be determined. 



The following evaluation function OFQ is used in accordance with this 
exemplary embodiment: 

OFQ = q(x; w a ), (35) 

• x = (v; p; q) denoting a state of the traffic system, 

• a denoting a speed restriction from the action space A of all speed restrictions 
which can be displayed by the traffic management system 650, and 

• w 3 denoting the weights of the neural network which belong to the speed 
restriction a. 



The following adaptation step is executed in Q-learning in order to determine 
the optimum weights w a of the neural network: 

+ 1 = + It • * K K) • VQ(x t ; wft) (36) 

using the abbreviation: 

d t = r(x t , a t/ x t + 1 ) + y max Q.(x t + i, w a ) - oj x t , J (37) 

• x t , x t +i denoting in each case a state of the traffic system in accordance with 
rule (34), 

• a t denoting an action, that is to say a speed restriction which can be displayed 
by the traffic management system 650, 

• y denoting a prescribable reduction factor, 

• wj 1 * denoting the weighting vector belonging to the action a t , before the 
adaptation step, 
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• w tli denoting the weighting vector belonging to the action a t , after the 
adaptation step, 

• rit (t = 1, ...) denoting a prescribable step size sequence, 

• k e [-1; 1] denoting a risk control parameter, 

• K K denoting a risk control function K K {£) = (1 - Ksign(^))^, 

• V Q ( • ) denoting the derivative of the neural network with respect to its 
weights, and 

• r(x tl a t , x t+ i) denoting a gain upon the transition in state from the state x t to the 
subsequent state x t+ i. 

An action at can be selected at random from the possible actions at during 
learning. It is not necessary in this case to select the action a t which has led to the 
largest evaluation variable. 

The adaptation of the weights has to be performed in such a way that not only is 
a traffic control achieved which is optimized in terms of the expectation of the 
optimization function, but that also account is taken of a variance of the control 
results. 

This is particularly advantageous since the state vector x(t) models the actual 
system of traffic only inadequately in some aspects, and so unexpected disturbances 
can thereby occur. Thus, the dynamics of the traffic, and therefore of its modeling, 
depend on further factors such as weather, proportion of trucks on the road, 
proportion of mobile homes, etc., which are not always integrated in the measured 
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variables of the state vector x(t). In addition, it is not always ensured that the road 
users immediately implement the new speed instructions in accordance with the 
traffic management system. 

A control phase on the real system in accordance with the traffic management 
system takes place in accordance with the following steps: 

1 . The state x t is measured at the instant t at various points in the traffic system 
of traffic and yields a state vector x(t): = (v(t), p(t), q(t)). 

2. A value of the optimization function is determined for all possible actions a t , 
and that action a t with the highest evaluation in the optimization function is 
selected. 

Although modifications and changes may be suggested by those skilled in the 
art to which this invention pertains, it is the intention of the inventors to embody 
within the patent warranted hereon all changes and modifications that may 
reasonably and properly come under the scope of their contribution to the art. - - 
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gomontj [- - METHOD AND ARRANGEMENT FOR DETERMINING 
A SEQUENCE OF ACTIONS FOR A SYSTEM 



BACKGROUND OF THE INVENTION 



Field of the Invention: 



This invention generally pertains to systems having states, and in particular to 



Discussion of the Related Art: 

A generalized method and arrangement for determining a sequence of actions for a 
system having states, wherein] a transition in state between two states {b o ing] [is] performed on 
the basis of an action[, is discussed by Neuneier in *Enhancing Q-Learning for Optimal Asset 
Allocation*, appearing in the Proceedings of the Neural Information Processing Systems, NIPS 
1997. Neuneier describes a financial market as an example of] {Th e inv e ntion re l at o s to a m e thod 
and an arrangem e nt for dotormining a sequ e nc e of actions for) a system which has states[. His]{r-a 
transition in stato botw ee n two stat e s boing perform e d on tho basis of an action. 

Such a m e thod and such an arrang e m e nt ar e known from [1]. 

A financial markot is describ e d in [1] as an o xamplo for such a oystom which has states. 
T-be} system is described as a Markov [Decision Problem (MDP).] {decision problem (MDP). 
Th e structur e of a syst e m which can b e d e scribed as a Markov decision probl e m is i l lustrat o d in 

Th e system 201 is i n a stat e xt at an instant t. Th e stato xt can be obs o rved by an obs e rv e r of 
th e system. On th o b a s i s of an action at from a sot in th e stat e xt of poss i bl e actions, at ( A(xt), th o 
system makes a transition with a c e rtain probability into a subsoquont state xt+1 at a subs o qu o nt 
instant t+1. 



methods] for determining a sequence of actions for 




h-}[such systems. 



This i s i ll ustrat e d diagr a mmatically in Figur e 2 by a loop. An obs o rv e r 200 perceiv e s 202 
obs e rvabl e variabl e s concerning th e stat e xt and tak e s a d e cision via an action 203 with which it aots 
on the syst e m 201 . Th e syst e m 201 is usua ll y subject to th e i nterfer e nc e 205. 

Furth e rmor e , the obs e rv e r 200 obtains a gain rt 20 4 

which is a funct i on of th e action at 203 and th e origina l stat e xt at th e instant t as well as of 
th e subs e qu e nt stat e xt+1 of th e syst e m at th e subs e qu e nt instant t+1. 

Th e gain rt can assume a pos i tiv e or n e gativ e skalar va l u e , d e pending on wh e ther tho 
d e c i sion l e ads, with r e gard to a pr e scribable crit e rion, to a positivo or n e gative syst e m d e v e lopm e nt, 
j n [1] to an incr e as e in capital stock or to a l oss. 

In a furth e r tim e step, th e obs e rv e r 200 of th e syst e m 201 d e cides on th e basis of th e 
obs e rvab le variabl e s 202, 20 4 of th e subs e qu e nt stat e xt+1 in favor of a n e w action at+1, e tc. 

A s e qu e nce of 

State: xt ( X 
Action: at ( A(xt) 
Subs e qu e nt stat e : xt+1 (X 
Gain rt - r(xt, at, xt+1) (( 

e tc. d e scr i bes a trajectory of th e syst e m which is eva l uat e d by a p e rformance cr i t e rion which 
accumulat e s th e i ndividua l ga i ns rt over th e instants t. It is assum e d by way of simplification in a 
Markov d e cision probl e m that th e stat e xt and th e action at al l contain information for th e purpos e of 
d e scribing a transition probability p(xt+1 (*) of th e syst e m from the stat e xt to the subs e qu e nt stat e 

I n formal t e rms, this m ea ns that: 

p(xt+1(xt,at) d e not e s a transit i on probabi l ity for the subs e qu e nt stat e xt+1 for a giv e n stat e xt 
and giv e n action at. 
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I n a Markov d e cis i on prob le m, futur e states of th e syst e m 201 aro thus not a function of 
stat e s and actions which lio furth e r in th e past than on e tim e stop.} 



The characteristics of a Markov {d e c i sion probl e m} [Decision Problem] are represented 
below by way of summary: 

X set of possible states of the system, 

e.g. X = m Bm , 

A(x t ) set of possible actions in the state 

{p(xt*W[p(xwlxJ,a,) *t 

r(x t , a t , x t+1 ) gain with expectation R(x t , a t ). 

Starting from observable variables, the variables denoted below as training data 
determine a strategy, that is to say a sequence of functions 

* = {no/ vi* K r m} ' 

which at each instant t map each state into an action rule, that is to say action 
M x t) = a t 

Such a strategy is evaluated by an optimization function. 

The optimization function specifies the expectation, the gains accumulated [ 
]over time at a given strategy 7t{0, and a start state x 0 . 

The so-called Q-learning method is described { i n [1]} [by Neuneier] as an example of a 
method of approximative dynamic programming. 

An optimum evaluation function V*(x) is defined by 

V*(x) = max V 5t (x) VxeX (5) 

n 



, the aim is to 
(3) 

(4) 
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V 7r (x) = E 



Z V t r(x t/ fi t , x t + i)[x 0 = x|, (6) 
.t = 0 



Y {^denoting a prescribable reduction factor which is formed in accordance with the following rule: 



_1 

1 + z ' 



* - 7—< (7) 



(8) 



A Q-evaluation function Q*(x t> a t ) is formed within the Q-learning method for each pair (state x t> 
action a t ) in accordance with the following rule: 

Q*(x t , a t > = £ P(*t-rlk< at) - r t + 

x ex 

+y • Z p( x ! x t' a t ) • max(Q*(x, a)) 
xeX a€A v ' 

(9) 

On the basis respectively of the tupel (x t , x t+i , a t , r t ), the Q-vaiues Q* (x,a) are [ 
]adapted in the k+1 th iteration in accordance with the following learning rule with a prescribed 
learning rate r\ m in accordance with the following rule: 

Qk + l( x t'a t ) = i 1 ~ TlkWxtr a t) + TlkUt + Y max(Q k (x t + 1 , a)) . (10) 

^ a€A / 



Usually, the so-called Q-values Q*(x,a) are approximated for various actions {a} by a function 
approximator in each case, for example a neural network or {else} a polynomial classifier, with a 
weighting vector w a , which contains weights of the function approximator. 

A function approximator is {to be understood as} , for example, a neural network, a polynomial 
classifier or { el s e } a combination of a neural network with a polynomial classifier. 
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It therefore holds that: 
Q*(x, a) « Cj|x; w a j . 



(11) 



Changes in the weights in the weighting vector w a are based on a temporal difference d, which 
is formed in accordance with the following rule: 

d t : = r(x t , a t , x t + i) + y max o(x t + 1 ; wg) - c/x t ; J {12) 

The following adaptation rule for the weights of the neural network, which are included in the 
weighting vector w a , follows for the Q-learning method with the use of a neural network: 

w k + l = "k* + ^ • d t * V< 2( x t' w?) • (13) 

The neural network representing the system of a financial market^ as described {in [1],} [by 
Neuneier] is trained using the training data which describe information on changes in prices on a 
financial market as time series values. 

A further method of approximative dynamic programming {, the so callod TD(() - l e arn i ng m e thod, 
is known from [2] and is oxplainod in mor e detail in conjunction with an e xemplary 
e mbodiment.) [is the so-called TD(A) 
learning method. This method is discussed in R.S. Sutton*s, *Learning To Predict By 
The Method Of Temporal Differences*, appearing in Machine Learning, Chapter 3, 
pages 9 -44, 1988.] 

Furthermore, it is known from {[3] wh i ch) [M. Heger*s, *Risk and Reinforcement Learning: 
Concepts and Dynamic Programming*, ZKW Bericht No. 8/94, Zentrum f*r 
Kognitionswissenschaften [Center for Cognitive Sciences], Bremen University, 
December 1994, that] risk is associated with a strategy 71 {Q-and an initial [ 
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]state x t . A method for risk {avoidm e nt is likew i se known from [3]} [avoidance is also discussed by 
Hager, cited above]. 

The following optimization function, which is also referred to as an expanded Q-function (Qffxtt 
£Q*(xt], a t ), is used in the [Hager] method {known from [3]} : 



Q w (x t , a t }. = r(x t , a t , x t + 1 ) + inf 
xn,x lf K 
p(x 0 , xx,k)>C 



k = l 

(14) 



The expanded Q-function fQffxt} [Q^Xt], a t ) describes the worst case if the action a t is executed 
in the state x t and the strategy % {Qis followed thereupon. 



The optimization function £Q£fx4J_[Q"(x t ], a t ) for 



2*( x t/ a t) : = max 2 n ( x tf a t ) 

It €ll 

(15) 



is given by the following rule: 



Q*(x t ,a t ) = rain r(x t , a t , x) + y • max Q*(x, a) . (16) 

X€X v aeA / 



p(xt + l|x t ra t )>0 



A substantial disadvantage of this mode of procedure is {to b e s ee n in} that only the worst case 
is taken into account when finding the strategy. However, this [inadequately] reflects the 
requirements of the most varied technical systems {on l y to an i nad e quate e xt e nt.} [.] 
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{Furth e rmor e , it is known from [4] to formulate) [In *Dynamic Programming and Optimal 
Control*, Athena Scientific, Belmont, 
MA, 1995, D.P. Bertsekas formulates] access control for a communications network [ 
]and {toe} routing within the communications network as a problem of dynamic [ 
]programming. 

{Th e } [Therefore, the present] invention is {th e r e fore} based on the problem of specifying a 
method and {an arrang e m e nt} [system] for determining a sequence of actions {for a system,} in 
which [the] method or {action} [sequences of actions achieve] an increased flexibility in 
determining the strategy { i s achi e v e d.} [needed.] 

{Th e probl e m is solved by th e m e thod and by th e arrang e m e nt in accordanc e with th e f e atur e s 
o f th e i nd e p e nd e nt pat e nt c l aims. 

In a m e thod for comput e r - aid e d d e t e rmination of a sequ e nc e of actions for a syst e m which has 
stat e s, a transition in stat e b e tw ee n two stat e s b e ing p e rform e d on th e basis of an action, th e 
d e t e rminat i on of th e s e qu e nc e of act i ons is p e rform e d in such a way that a s e qu e nce of stat e s 
r e su l ting from th e s e qu e nc e of act i ons i s optimiz e d w i th r e gard to a pr e scr i bed optim i zation funct i on, 
th e optimization 

function i ncluding a variabl e param e t e r with th e a i d of which it is possibl e to set a risk which th e 
r e su l ting s e qu e nc e of states has with r e sp e ct to a pr e scrib e d stat e of th e syst e m. 
An arrang e ment for d e t e rm i ning} [In a method for computer-aided determination of] a sequence of 
actions for a system which has states, a transition in state between two states being performed on the 
basis of an action, {has a proc e ssor wh i ch is s e t up in such a way that} the determination of the 
sequence of actions {can b e } [is] performed in such a way that a sequence of states resulting from 
the sequence of actions is optimized with regard to a prescribed optimization function, the 
optimization function including a variable parameter with the aid of which it is possible to set a risk 
which the resulting sequence of states has with respect to a prescribed state of the system. 
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[A system for determining a sequence of actions for a system which has states, a 
transition in state between two states being performed on the basis of an action, has a 
processor which is set up in such a way that the determination of the 
sequence of actions can be performed in such a way that a sequence of states 
resulting from the sequence of actions is optimized with regard to a prescribed optimization 
function, the optimization function including a variable parameter with the aid of which it is 
possible to set a risk which the resulting sequence of states has 
with respect to a prescribed state of the system. 

Thus, the present invention offers] ( I t b e com e s poss i b le for th e first tim e owing to th o 
inv e ntion to specify} a method for determining a sequence of actions at a freely prescribable level of 
accuracy when finding a strategy for a possible closed-loop control or open-loop control of the 
system, in general for influencing it. [Hence, the embodiments] {Pr e f e rr e d d e v e lopments of tho 
inv e ntion fo l low from th e d e p e nd e nt claims. 

Th e d e v e lopm e nts} described below are valid both for the method and for the {arr a ng e m e nt, th e 
proc e ssor b e ing r e sp e ctiv el y s e t up in th e d e v e lopm e nt of arrang e m e nt in such a way that th e 
dev e lopm e nt can bo imp le m e nted.} [system.] 

{In a pr e f e rred ref i n e m e nt, a m e thod of approximat i v e } [Approximative] dynamic 
programming is used for the purpose of {d e t e rminat i on} [determina-tion], for example a method 
based on Q-learning or {e l s e } a method based on TD(X{G}[)]-learning. 

Within Q-learning, the optimization function OFQ is preferably formed in accordance with the 
following rule: 



OFQ 




{Qx denoting a state in a state space X 

{Qa denoting an action from an action space A, and 

{$w a denoting the weights of a function approximator which belong to the action a. 



The following adaptation step is executed during Q-learning in order to determine the 
optimum weights w a of the function approximator: 

w^ +1 = w *t +llt . K K( dt) . VQ (x t ;w^) 

with the abbreviation 

d t = r(x t , a t , x t + 1 ) + y max Q(x t +i, wf ) - o(x t , w^) 

• {Qx,, x t +1 respectively denoting a state in the state space X, 

• {Qa t denoting an action from an action space A, 

• y {f-(}denoting a prescribable reduction factor, 

• w t 1 1 {^denoting the weighting vector associated with the action a t before the adaptation 
step, 

• w ^+i | {^denoting the weighing vector associated with the action a t after the adaptation 
step, 

• ^{HtfC 1 = 1> ••■) denoting a prescribable step size sequence, 

• k s 1] denoting a risk monitoring parameter, 

• N K {HQdenoting a risk monitoring function X K (£[) = (1*- tcsign(%))5,] {( (() = (1 - (s i gn(())(, 

VQ(*;*) denoting the derivation of the function approximator according to its weights, and 
{Qr(x t , a t , Xt +1 ) denoting a gain upon the transition of state from the state x t to the subsequent 
state x t+1 . 

The optimization function is preferably formed in accordance with the following [ 



]rule within the TD(A.{0}[)]-leaming method: 
OFTD = J(x;w) 
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• {Qx denoting a state in a state space X, 

• {Qa denoting an action from an action space A, and 

• {$w denoting the weights of a function approximator. 

The following adaptation step is executed during TD(\{0}[)]-learning in order to determine the 
optimum weights w of the function approximator: 

wui = w t + r\ m ± m *] X K (d t ) * z, 

with the abbreviations 

dt = r(w t , a,, x t+1 ) + Y{QJ(xt+i; w t ) - J(x t ; w t ), 

z.^™y 4 4,+v«}j(x t ;w t ), 

ZLi =0 

• {€*t. x t+1 respectively denoting a state in the state space X, 

• {Qa t denoting an action from an action space A, 

• Y {H}denoting a prescribable reduction factor, 

• {(}w t denoting the weighting vector before the adaptation step, 

• {Qwt-M denoting the weighting vector after the adaptation step, 

• (t = 1 , ...) denoting a prescribable step size sequence, 

• k e {( ( (} [-1 ; 1] denoting a risk monitoring parameter, 

• N K {H$denoting a risk monitoring function N K (^[) = (1*- icsign(5))%,] {( (() = (1 - (s i gn(())(, 
H> 

• VJ(*;*) denoting the derivation of the function approximator according to its [ 
]weights, and 

• {{M*. at. Xf-i) denoting a gain upon the transition of state from the state x t to the subsequent 
state x t+1 . 



TSUMMARY OF THE INVENTION 



It is an object of the present invention to provide a technical system and method for 
determining a] {Th e system i s pref e rably a t e chn i cal syst e m of which b e for e th e d e t e rmination 
m e asur e d valu e s ar e measur e d which ar e us e d i n d e termining th e } sequence of actions [using 
measured values. 

It is another object of the present invention to provide a technical system and method 
thatjf 

Th e t e chn i cal system } can be subjected to open-loop control or { e ls e } closed-loop control with the 
use of {the} [a] determined sequence of actions. 

{Th e syst e m is pr e f e rably} [It is a further object of the invention to provide a technical 
system and method] modeled as a Markov {d e cision prob le m.} [Decision Problem.] 

{Th e m e thod or th e arrang e m e nt is pr e f e rably} [It is an additional object of the invention to 
provide a technical system and method that can be] used in a traffic management system {er}[. 

It is yet another object of the invention to provide a technical system and method that 
can be used] in a communications system, {the} [such that a] sequence of actions {being us e d in a 
communications n e twork} [is used ]to carry out access control {or a routing, that is to say a path 
a l location .} [, routing or path allocation.] 

{Furth e rmor e , th e syst e m can b e a financial mark e t which is mod e l e d by a Markov d e cision 
prob le m, th e chang e in th e financia l mark e t, for e xampl e th e } [It is yet a further object of the 
invention to provide a technical system and 

method for a financial market modeled by a Markov Decision Problem, wherein a] change in an 
{ 

}index of stocks[,] or {else} [a change in] a rate of exchange on a foreign exchange market {b e ing 
analyz e d by using th e m e thod and/or th e arrang e m e nt and it b e ing} [, makes it ]possible to intervene 
in the market in accordance with {th e } [a] sequence of determined actions. 



[These and other objects of the invention will be apparent from a careful review 
of the following detailed description of the preferred embodiments, which is to read in 
conjunction with a review of the accompanying drawing figures. 



BRIEF DESCRIPTION OF THE DRAWINGS! {Exe mp l ary e mbodiments of tho invontion - 
Hlu stratod in th e figur e s and e xplained in more d e ta i l b e low 4 



Figure 1 shows a flowchart {in which individua l } [of] method steps (of th e first e x e mp l ary 

e mbod i ment ar e illustrated} [according to the present invention]; 
Figure 2 shows a {sk e tch of a} system {which can be} modeled as a Markov {d e cision 

probl e m} [Decision Problem]; 
Figure 3 shows a {sketch of a} communications network {in which} [wherein] access control is 

carried out in a switching unit [according to the present invention]; 
Figure 4 shows a {symbo l ic sk e tch-o f-a } function approximator {with the aid of which a m e thod 

ef} [for] approximative dynamic programming {is implem e nt e d} [according to the 

present invention]; 

Figure 5 shows {a furth e r sk e tch of} a plurality of function approximators {with th e aid of which} 

[for] approximative dynamic programming {is impl e m e nt e d} [according to the 
present invention]; and 

Figure 6 shows a {sketch of a} traffic management system {which is} subjected to closed-loop 

control in accordance with {an e x e mplary e mbodim e nt.} [the present invention.] 



(F i rst oxomplarv o mbod i mont: accoss contro l and routin g^ [DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS! 



-}[Figure 1 shows a flowchart according to the present invention, in which 
individual method steps of a first embodiment are provided, which will be discussed 
later. 



Figure 2 shows the structure of a typical Markov Decision Problem method. 



The system 201 is in a state x t at an instant t. The state x t can be observed by 
an observer of the system. On the basis of an action a, from a set in the state x t of 
possible actions, a t e A(x t ), the system makes a transition with a certain probability 
into a subsequent state x t +1 at a subsequent instant t+1. 

As illustrated diagrammatically in Figure 2 by a loop, an observer 200 perceives 202 
observable variables concerning the state x, and takes a decision via an action 203 with which 
it acts on the system 201. The system 201 is usually subject to the interference 205. 

The observer 200 obtains a gain r t 204 



which is a function of the action a t 203 and the original state x t at the instant t as well as of the 
subsequent state x t +1 of the system at the subsequent instant t+1. 

The gain r, can assume a positive or negative scalar value depending on whether the 
decision leads, with regard to a prescribable criterion, to a positive or negative system 
development, to an increase in capital stock or to a loss. 

In a further time step, the observer 200 of the system 201 decides on the basis of the 
observable variables 202, 204 of the subsequent state x t+1 in favor of a new action a t+1 , etc. 

A sequence of 
State: x, e X 



r t = r(x t , a t , x t + i) e * , 



(1) 



Action: 



A(xO 



Subsequent state: 



x t +1 



G 



X 



Gain 



r t = r(x t , a t , x t+1 ) s 



14 



describes a trajectory of the system which is evaluated by a performance criterion which 
accumulates the individual gains r t over the instants t. It is assumed by way of simplification in 
a Markov Decision Problem that the state x, and the action a t all contain information for the 
purpose of describing a transition probability p(x t+1 1 *) of the system from the state x t to the 
subsequent state x t+1 . 

In formal terms, this means that: 
p(xt+lk/K,x 0 ,a t/ K,ao) = p(x t + 1 |x t , a t ) . (2) 

p(x t+1 1 x t ,a t ) denotes a transition probability for the subsequent state x, +1 for a given state x t and 
given action a t . 

In a Markov Decision Problem, future states of the system 201 are thus not a function 
of states and actions which lie further in the past than one time step. 

Figure 3 shows an embodiment of the present invention involving an access control and 
routing system, such as] a communications network 300 {, which) [. 

The communications network 300] has a multiplicity of switching units 301a, 301b, 301 i, 
... 301n, which are interconnected via connections 302a, 302b, 302j, ... 302m. [A] {Furth e rmor e , a} 
first terminal 303 is connected to a first switching unit 301a. From the first terminal 303, the first 
switching unit 301a is sent a request message 304 which requests preservation of a prescribed 
bandwidth within the communications network 300 for the purpose of transmitting data{Q[, such as] 
video dataf} [or] text dataQ}. 

It is determined in the first switching unit 301a in accordance with a strategy described below[,] 
whether the requested bandwidth is available in the {communications} [communi-cations] network 
300 on a specified, requested connection {(st e p 305)} [instep 305]. { 
}The request is refused {(st e p 306)} [instep 306] if this is not the case. { 
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}lf sufficient bandwidth is available, it is checked in {a furth e r} checking step {(st e p 307)} [307] 
whether the bandwidth can be reserved. 



The request is refused {(st e p 308)} [in step 308] if this is not the case. { 
}Otherwise, the first switching unit 301a selects a route from the first switching unit 301a via further 
switching units 301 i to a second terminal 309 with which the first terminal 303 wishes to 
communicate, and a connection is initialized {(st e p 310)} [in step 310]. 

The starting point below is a communications network 300 which comprises a set of switching 

units 

N= {l,K , n, K , N} (17) 
and a set of physical connections 

L = {l, K , 1, K , l} , (18) 
a physical connection I having a capacity of B(l) bandwidth units. 

A set 

M= {l, K , m, K r M} (19) 

of different types of service m are available, a type of service m being characterized by 
• {(}a bandwidth requirement b(m), 



• •fQ-an average connection time — - — , H and 
V(m) " 



I 



• {Q-a gain c(m) which is obtained whenever a call request of the corresponding type of service 
m is accepted. 

The gain c(m) is given by the amount of money which a network operator of the 
communications network 300 bills a subscriber for a connection of the type of service. Clearly, the 
gain c(m) reflects different priorities, which can be prescribed by the network operator and which he 
associates with different services. 



A physical connection 1 can simultaneously provide any desired combination of 
communications connections as long as the bandwidth used for the communications connections 
does not exceed the bandwidth available overall for the physical connection. 

If a new communications connection of type m is requested between a first node i and a second 
node j (terminals are aiso denoted as nodes), the requested communications connection can, as 
represented above, either be accepted or be refused. { 

}lf the communications connection is accepted, a route is selected from a set of prescribed routes. 
This selection is denoted as a routing. b(m) bandwidth units are used in the communications 
connection of type m for each physical connection along [ 
]the selected route for the duration of the connection. 

Thus, during access control^, also referred to as] call admission control{)}, a route can be 
selected within the communications network 300 only when the selected route has sufficient 
bandwidth available. { 

}The aim of the access control and of the routing is to maximize a long term gain which is obtained by 
acceptance of the requested connections. 

At an instant t, the technical system which is the communications network 300 is in a state x t 
which is described by a list of routes via existing connections, by means of which lists it is shown how 
many connections of which type of service are using the respective routes at the instant t. 

Events w, by means of which a state x t could be transferred into a subsequent state x t+1 , are the 
arrival of new connection request messages, or else the {t e rmination} [termina-tion] of a connection 
existing in the communications network 300. 

In this {ex e mplary} embodiment, an action a t at an instant t[,] owing to a connection request is 

the{ 
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}decision as to whether a connection request is to be accepted or refused and, if the connection is 
accepted, the selection of the route through the communications network 300. 

The aim is to determine a sequence of actions, that is to say clearly to determine the learning of 
a strategy with actions relating to a state x t in such a way that the following rule is maximized: 

Xe-Pt* • g(x tjc ,G) k ,a tk ) , (20) 

• {Q-E{.} denoting an expectation, 

• {Q-t k denoting an instant at which a kth event takes place, 

• s( x tk ' ' a tk ) denotin g tne 9 ain which is associated with the kth event, and 

• P {{-(^denoting a reduction factor which evaluates an immediate gain as being more valuable 
than a gain at instants lying further in the future. 

Different implementations of a strategy lead normally to different overall gains G: 
oo 

G = Se"^ • g(x t]c ,© k , a t}c ). (21) 
k = 0 



The aim is to maximize the expectation of the overall gain G in accordance with the following 
rule J: 

J = Ej Je'^k • g(x tk , © k , a tk ) 
U = 0 

it being possible to set a risk which reduces the overall gain G of a specific implementation of access 
control and of a routing strategy to below the expectation. 




The TD(k{OM)]-learning method is used to carry out the access control and the [ 
]routing.{ 



The following target function is used in this ( e x e mplary) embodiment: 
J*(x t ) = E T je-^jE^ jmaxjg(x t , co t , a) + J*(x t + 1 )]j, (23) 

• {QA denoting an action space with a prescribed number of actions which are respectively 
available in a state x t , 

• x {f-(}denoting a first instant at which a first event co {Qoccurs, and 

• {0x t+ i denoting a subsequent state of the system. 

An approximated value of the target value J*(x t ) is learned and stored by employing a function 
approximator 400 (compare Figure 4) with the use of training data. 

Training data are data previously measured in the communications network 300 and relating to 
the behavior of the communications network 300 in the case of incoming connection requests 304 
and of termination of messages. This time sequence of states is stored, and these training data are 
used to train the function approximator 400 in accordance with the learning method described below. 

A number of connections of in each case one type of service m on a route of the 
communications network 300 serve in each case as input variable of the function approximator 400 
for each input 401 , 402, 403 of the function approximator 400. [ 
]These are represented {symbolically} in Figure 4 by blocks 404, 405, 406. { 
}An approximated target value J | of the target value J* is the output variable of the function 
approximator 400. 

Figure 5 shows a detailed representation of {the} [a] function approximator 500, which {in this 
c a s e } has several component function approximators 510, 520 {of the function approximator 500} . [ 

]One output variable is the approximated target value J |, which is formed in accordance with 
the following rule: 
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(24) 



The input variables of the component function approximators 510, 520, which are present at the 
inputs 51 1, 512, 513 of the first component function approximator 510, or at the inputs 521, 522 and 
523 of the second component function approximator 520 are, in turn, respectively a number of types 
of service of a type m in a physical connection r in each case, symbolized by blocks 514, 515, 516 for 
the first component function approximator, and 524, 525 and 526 for the second component function 
approximator 520. 

Component output variables 530, 531, 532, 533 are fed to an adder unit 540, and the 
approximated target variable J | is formed as output variable of the adder unit. 

Let it be assumed that the communications network 300 is in the state x tk Jj and that a 
request message with which a type of service m of class m is requested for a connection { 
}between two nodes i, j reaches the first switching unit 301a. 

A list of permitted routes between the nodes i and j is denoted by R(i, j), and a list of ail possible 
routes is denoted by 



as a subset of the routes R(i, j) which could implement a possible connection with regard to the 
available and requested bandwidth. 



determined which results from the fact that the connection request 304 is accepted and the 
connection on the route r is made available to the requesting first terminal 303. 



R(i, j, x tk ) c R(i, j) 



(25) 



For each possible route r, reR.(i, j^k)! 1 a SUDSec l uen * state 
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This is illustrated in Figure 1 as {second} step {(st e p 102)} [102], the state of the system and the 
respective event being respectively determined in {a first} step {(step 101)} [101]. { 
}A route r* to be selected is determined in {a third} step {(st e p 1 03)} [103] in accordance with the 
following rule: 

r* = arg max j(x t]c +i(x tk , co^, r), © t ) . (26) 
reR(i,j,x tk ) 

A check is made in {a furth e r} step {(st e p 10 4 )} [104] as to whether the following rule is fulfilled: 
c ( m ) + j(xt k +l(*t k , "k' r *)' ©t) < %t k '©t)- (27) 



If this is the case, the connection request 304 is rejected {(st e p 105)} [in step 105], otherwise 
the connection is accepted and *switched through* to the node j along the selected route r* {(st e p 
40§)} [in step 106]. 

Weights of the function approximator 400, 500 which are adapted in the TD(^{0}[)]-Iearning 
method to the training data, are stored in a parameter vector 0 {Qfor an instant t in each case, such 
that an optimized access control and an optimized routing are achieved. 

During the training phase, the weighting parameters are adapted to the training data applied to 
the function approximator. 

A risk parameter k {Q\s defined with the aid of which a desired risk, which the system has with 
regard to a prescribed state owing to a sequence of actions and states, can be set in accordance with 
the following rules: 



-1 < k {{-{}< 0: risky learning, 

k {Q= 0: neutral learning with regard to the risk, 

0 < k {(}< 1 : risk-avoiding learning, 

k {0= 1 : worst-case learning. 
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Furthermore, a prescribable parameter 0 < X < {( ( (} 1 and a step size sequence y m are 
prescribed in the learning method. 



The weighting values of the weighting vector © {Qare adapted to the training data on the basis 
of each event <a tk | in accordance with the following adaptation ruie: 

0 k = 0 k _x + y k K K (d k )z t , (28) 



in which case 



d k = e*-^-l)(g(K tk ,(D k ,a tk ) + j(x tk ,0 k -i| - ^t k _ r ©k-l) 



(29) 



z t = ^e"^-l~ t k-2) 2t _ 1 + V 0 j(x tk _ 1 ,0 k -i), (30) 
and 

K K (4) = (l - Ksignft))§. (31) 
It is assumed that: z_i = 0. 

The function 

g(xt k ,co k/ a t k ) (32) 
denotes the immediate gain in accordance with the following rule: 

] 

Thus, as described above, a sequence of actions is determined with regard to a connection 
request such that a connection request is either rejected or accepted on the basis of an action. The 
determination is performed taking account of an optimization function in which the risk can be set by 
means of a risk control parameter k e {H}[-1; 1] in a variable fashion. 
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{S e cond e x e mplary e mbodim e nt: Traffic manag e m e nt syst e m} [Figure 6 shows an 
embodiment of the present invention in relation to a traffic management system] 



-}[A] road 600 on which automobiles 601 , 602, 603, 604, 605 and 606 are 
being [ 
]driven.{ 

}■ Conductor loops 610, 61 1 integrated into the road 600 receive electric signals in a known way and 
feed the electric signals 615, 616 to a computer 620 via an input/output interface 621 . In an analog-to- 
digital converter 622 connected to the input/output interface 621, the electric signals are digitized into 
a time series and stored in a memory 623, which is connected by a bus 624 to the analog-to-digital 
converter 622 and a processor 625. Via the input/output interface 621 , a traffic management system 
650 is fed control signals 651 from which it is possible to set a prescribed speed stipulation 652 in the 
traffic management system 650, or else further particulars of traffic regulations, which are displayed 
via the traffic management system 650 to drivers of the vehicles 601 , 602, 603, 604, 605 and 606. 

The following local state variables are used in this case for the purpose of traffic modeling: 
• {^-traffic flow rate v, 



{^vehicle density p (p {{{}= number of vehicles per kilometer — ■). 

km P 



I 

I 



FzL 

{^traffic flow q (q = number of vehicles per hour — ■, (q= v * p{Q}), and 

h 

{Qspeed restrictions 652 displayed by the traffic management system 650 at an instant in 
each case. 



The local state variables are measured as described above by using the conductor loops 610, 



611. 



These variables (v(t), p{Q<t), Q(t)) therefore represent a state of the technical system of "traffic* 
at a specific instant t. 
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In this {exemplary) embodiment, the system is therefore a traffic system which is controlled by 
using the traffic management system 650[, andjf 

In this o o cond oxomplary ombodim o nt, ) an extended Q-leaming method is described as method of 
approximative dynamic programming. 

The state x t is described by a state vector 
x(t)= (v(t> p(t), q(t)) . (34) 



The action a, denotes the speed restriction 652, which is displayed at the instant t by the traffic 
management system 650. { 

}The gain r(x,, a,, x t+1 ) describes the quality of the traffic flow which was measured between the 
instants t and t+1 by the conductor loops 610 and 611 . [ 

] |r > this {second ex e mplary) embodiment, r(x t , a,, x, +1 ) denotes 

• {Qthe average speed of the vehicles in the time interval [t, t + 1] 
or 

{Qthe number of vehicles which have passed the conductor loops 610 and 61 1 in the time 
interval [t, t+ 1] 

or 

{Qthe variance of the vehicle speeds in the time interval [t, t + 1], 

or 

• {0a weighted sum from the above variables. 

A value of the optimization function OFQ is determined for each possible action a t , that is to say 
for each speed restriction which can be displayed by the traffic management system 650, an 
estimated value of the optimization function OFQ being realized in each case as a neural network. 

This results in a set of evaluation variables for the various actions a t in the [ 

- OA 



]system state x t .{ 

} Those actions a t for which the maximum evaluation variable OFQ has been determined in the 
current system state x t are selected in a control phase from the possible actions a tl that is to say from 
the set of the speed restrictions which can be displayed by the traffic management system 650. 

In accordance with this { e x e mplary} embodiment, the adaptation rule, known from the Q- 
leaming method, for calculating the optimization function OFQ is extended by a risk control function 
N K {((M-). which takes account of the risk. 

In turn, the risk control parameter k {Qis prescribed in accordance with the strategy from the 
first exemplary embodiment in the interval of [-1 {(-(-{}[*:£***<*] 1], and represents the risk which a user 
wishes to run in the application with regard to the control strategy to be determined. 

The following evaluation function OFQ is used in accordance with this exemplary embodiment: 



• {ft* = (v; p{{}; q) denoting a state of the traffic system, 

• {Qa denoting a speed restriction from the action space A of all speed restrictions which can 
be displayed by the traffic management system 650, and 

• {Qw a denoting the weights of the neural network which belong to the speed restriction a. 

The following adaptation step is executed in Q-learning in order to determine [ 
]the optimum weights w a of the neural network: 




(35) 



w. 



r t + l 



= w, 



+ r| t ■ K K (d t ) • VQ|x t ; w! 




(36) 



using the abbreviation!:] 



d t = r(x t , a t , x t + 1 ) + y max Q( 
aeA 



x t + i, w t 



't a )-Q(xt,w^) 



(37) 



{Qx t , x t+ i denoting in each case a state of the traffic system in accordance with rule (34), 



{Qa t denoting an action, that is to say a speed restriction which can be displayed by the traffic 

management system 650, 

y {(-(}denoting a prescribable reduction factor, 

w t** |-K} denotin 9 tne weighting vector belonging to the action a t , before the adaptation 
step, 

J{Qdenoting the weighting vector belonging to the action a,, after the adaptation step, 

Ti{f#t (t = 1, -■) denoting a prescribable step size sequence, 

k e {fHM-1 ; 1] denoting a risk control parameter, 
K x {HQdenoting a risk control function K K (£[) = (1*-*Ksign(^))^,] {( (() ~ (1 — (s i gn(())(, 

V Q (*;*) denoting the derivative of the neural network with respect to its weights, and 
{Qr(x,, a t , x t+1 ) denoting a gain upon the transition in state from the state x, to the subsequent 
state x t+1 . 



An action a t can be selected at random from the possible actions a t during [ 
]learning. It is not necessary in this case to select the action a t which has led to the largest evaluation 
variable. 



The adaptation of the weights has to be performed in such a way that not only is [ 
]a traffic control achieved which is optimized in terms of the expectation of the optimization function, 
but that also account is taken of a variance of the control results. 



This is particularly advantageous since the state vector x(t) models the actual system of traffic 
only inadequately in some aspects, and so unexpected disturbances can thereby occur. Thus, the 
dynamics of the traffic, and therefore of its modeling, depend on further factors such as weather, 
proportion of trucks on the road, proportion of mobile homes, etc., which are not always integrated in 
the measured variables of the state vector x(t). In addition, it is not always ensured that the road 
users immediately implement the new speed instructions in accordance with the traffic management 
system. 
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A control phase on the rea! system in accordance with the traffic management system takes 
place in accordance with the following steps: 

1 . The state x t is measured at the instant t at various points in the traffic system of traffic and 
yields a state vector x(t): = (v(t), pfG(t), q(t)). 

2. A value of the optimization function is determined for all possible actions a t , and that action a t 
with the highest evaluation in the optimization function is selected. 

Abstr o c t 

M e thod and arrangem e nt for d e termining a sequenc e of actions for a syst e m which has states, a 
transition in stat o b e tween two statos b e ing p e rformod on th e basis of an action 

Th e d e t e rmination of a sequ e nc e of actions is p e rformed i n such a way that a s e quenc e of states 
r e sulting from th e s e qu e nc e of act i ons is optmiz e d with regard to a pr e scribed optimization 
function. The optimizat i on function includos a variable param e ter with the aid of which it is 
possib l e to set a risk which th e r e sulting s e qu e nc e of stat e s has with rosp o ct to a pr e scribed 
stat e of th e system.} [Although modifications and changes may be suggested by those 
skilled in the 

art to which this invention pertains, it is the intention of the inventors to embody 
within the patent warranted hereon all changes and modifications that may 
reasonably and properly come under the scope of their contribution to the art. - -] 
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Description 

Method and arrangement for determining a sequence of 
actions for a system which has states, a transition in 
5 state between two states being performed on the basis 
of an action 

The invention relates to a method and an 
arrangement for determining a sequence of actions for a 
10 system which has states, a transition in state between 
two states being performed on the basis of an action. 

Such a method and such an arrangement are known 
from [1] . 

A financial market is described in [1] as an 
15 example for such a system which has states. 

The system is described as a Markov decision 
problem (MDP) . The structure of a system which can be 
described as a Markov decision problem is illustrated 
in Figure 2 . 

20 The system 201 is in a state x t at an instant 

t. The state x t can be observed by an observer of the 
system. On the basis of an action a t from a set in the 
state x t of possible actions, a t e A(x t ), the system 
makes a transition with a certain probability into a 

25 subsequent state x t +l at a subsequent instant t+1. 

This is illustrated diagrammatically in Figure 
2 by a loop. An observer 200 perceives 202 observable 
variables concerning the state x t and takes a decision 
via an action 203 with which it acts on the system 201. 

30 The system 201 is usually subject to the interference 
205. 

Furthermore, the observer 200 obtains a gain r t 

204 



GR 98 P 2663 

- 2 - 

r t = r(x t , a t , x t + 1 ) e 91 , 



(1) 



which is a function of the action a t 203 and the 
original state x t at the instant t as well as of the 
5 subsequent state x t +l of the system at the subsequent 
instant t+1. 

The gain r t can assume a positive or negative 
skalar value, depending on whether the decision leads, 
with regard to a prescribable criterion, to a positive 
10 or negative system development, in [1] to an increase 
in capital stock or to a loss. 

In a further time step, the observer 200 of the 
system 201 decides on the basis of the observable 
variables 202, 204 of the subsequent state x t+1 in favor 
15 of a new action a t +i, etc. 

A sequence of 



State : 


x t 


e 


X 


Action : 


a t 


s 


A(x t : 


20 Subsequent state: 


x t +l 


e 


X 


Gain r t = 


r(x t , a t , x t +i) 


e 


9i 



etc. describes a trajectory of the system which is 
evaluated by a performance criterion which accumulates 

25 the individual gains r t over the instants t. It is 
assumed by way of simplification in a Markov decision 
problem that the state x t and the action a t all contain 
information for the purpose of describing a transition 
probability p(x t+1 |-) of the system from the state x t to 

30 the subsequent state x t+1 . 

In formal terms, this means that: 

p( x t + l|x t ,K , x 0 , a t ,K ,a 0 ) = p(x t + x |x t , a t ) . (2 ) 
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p (x t +il x t , a t ) denotes a transition probability for the 
subsequent state x t +i for a given state x t and given 
action a t . 

In a Markov decision problem, future states of 
5 the system 201 are thus not a function of states and 
actions which lie further in the past than one time 
step . 

The characteristics of a Markov decision 
problem are represented below by way of summary: 

10 

X set of possible states of the system, 

e.g. X = 9T, 

A(x t ) set of possible actions in the state 

p (x t+1 |x t , a t ) x t 
15 r(x t , a t , x t +i) gain with expectation R(x t , a t ) . 



Starting from observable variables, the 
variables denoted below as training data, the aim is to 
determine a strategy, that is to say a sequence of 
2 0 functions 

n = {hq, m,K , |i T }, (3) 

which at each instant t map each state into an action 
rule, that is to say action 

H t (x t ) = a t (4) 

25 Such a strategy is evaluated by an optimization 

function. 
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The optimization function specifies the 
expectation, the gains accumulated over time at a given 
strategy n, and a start state x 0 . 

The so-called Q-learning method is described in 
[1] as an example of a method of approximative dynamic 
programming . 

An optimum evaluation function V* (x) is defined 

by 

V*(x) = max V*(x) Vx eX (5) 



10 where 

V*(x) = E 



.t=o 



y denoting a prescribable reduction factor which is 
formed in accordance with the following rule: 

Y = — i— , (7) 



z e<R + , (8) 

A Q-evaluation function Q*(x t ,a t ) is formed within the 
Q-learning method for each pair (state x t/ action a t ) in 
accordance with the following rule: 



Q*(x t , a t ) = X p(x t + i|x t , a t ) • r t + 

xeX 

+ y ■ Z p( x ! x t' a t) • max(Q*(x, a)) 
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On the basis respectively of the tupel (x t , 
x t+if a t , r t ) , the Q-values Q* (x,a) are adapted in the 
k+1 th iteration in accordance with the following 
learning rule with a prescribed learning rate rj k in 
5 accordance with the following rule: 

Qk + l( x f a t) = t 1 " TlkMxt' a t) + *lk(=t + y max(Q k (x t + 1 , a))) . (10) 

Usually, the so-called Q-values Q*(x,a) are 
approximated for various actions a by a function 
approximator in each case, for example a neural network 
10 or else a polynomial classifier, with a weighting 
vector w a , which contains weights of the function 
approximator . 

A function approximator is to be understood as, 
for example, a neural network, a polynomial classifier 
15 or else a combination of a neural network with a 
polynomial classifier. 

It therefore holds that: 

Q*(x, a) * q(x,- w a ]. (11) 

Changes in the weights in the weighting vector 
2 0 w a are based on a temporal difference d t which is formed 
in accordance with the following rule: 

d t : = r(x tr a t , x t + i) + Y max c( x t + i; w k) " Qf*t» w ? t ) (12) 

The following adaptation rule for the weights 
of the neural network, which are included in the 
25 weighting vector w a , follows for the Q-learning method 
with the use of a neural network: 
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w k + l = + ^ "d t •VQ(x t ;w*t). 



(13) 



The neural network representing the system of a 
financial market, as described in [1], is trained using 
the training data which describe information on changes 
in prices on a financial market as time series values. 

A further method of approximative dynamic 
programming, the so-called TD (X) -learning method, is 
known from [2] and is explained in more detail in 
conjunction with an exemplary embodiment. 

Furthermore, it is known from [3] which risk is 
associated with a strategy % and an initial state x t . A 
method for risk avoidment is likewise known from [3] . 

The following optimization function, which is 
also referred to as an expanded Q-function Q* ( x t , a t ) , is 
used in the method known from [3] : 



The expanded Q-function Q* ( x t , a t ) describes the 
worst case if the action a t is executed in the state x t 
and the strategy n is followed thereupon. 

The optimization function Cf'xt, a t ) for 



Q"(x t , a t ) = r(x t , a t , x t+1 ) + 



inf i 
(x 0 , X!,K)>0 




(14) 
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Q*( x t> a t) : = max 9*( K t' a t) 
it en 

(15) 

is given by the following rule: 

Q*(x tf a t ) = min r(x t , a t , x) + y • max Q*(x, a) . (16) 

- v xeX >> aeA J 

p(x t + 1 |xt,a t )>0 

5 A substantial disadvantage of this mode of 

procedure is to be seen in that only the worst case is 
taken into account when finding the strategy. However, 
this reflects the requirements of the most varied 
technical systems only to an inadequate extent. 

10 Furthermore, it is known from [4] to formulate 

access control for a communications network and the 
routing within the communications network as a problem 
of dynamic programming. 

The invention is therefore based on the problem 

15 of specifying a method and an arrangement for 
determining a sequence of actions for a system, in 
which method or action an increased flexibility in 
determining the strategy is achieved. 

The problem is solved by the method and by the 

2 0 arrangement in accordance with the features of the 
independent patent claims . 

In a method for computer-aided determination of 
a sequence of actions for a system which has states, a 
transition in state between two states being performed 

25 on the basis of an action, the determination of the 
sequence of actions is performed in such a way that a 
sequence of states resulting from the sequence of 
actions is optimized with regard to a prescribed 
optimization function, the optimization 

30 
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function including a variable parameter with the aid of 
which it is possible to set a risk which the resulting 
sequence of states has with respect to a prescribed 
state of the system. 

An arrangement for determining a sequence of 
actions for a system which has states, a transition in 
state between two states being performed on the basis 
of an action, has a processor which is set up in such a 
way that the determination of the sequence of actions 
can be performed in such a way that a sequence of 
states resulting from the sequence of actions is 
optimized with regard to a prescribed optimization 
function, the optimization function including a 
variable parameter with the aid of which it is possible 
to set a risk which the resulting sequence of states 
has with respect to a prescribed state of the system. 

It becomes possible for the first time owing to 
the invention to specify a method for determining a 
sequence of actions at a freely prescribable level of 
accuracy when finding a strategy for a possible closed- 
loop control or open-loop control of the system, in 
general for influencing it. 

Preferred developments of the invention follow 
from the dependent claims . 

The developments described below are valid both 
for the method and for the arrangement, the processor 
being respectively set up in the development of 
arrangement in such a way that the development can be 
implemented. 

In a preferred refinement, a method of 
approximative dynamic programming is used for the 
purpose of determination, for example a method based on 
Q-learning or else a method based on TD (X) -learning . 
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Within Q-learning, the optimization function 
OFQ is preferably formed in accordance with the 
following rule: 



OFQ = q(x; w a j , 



• x denoting a state in a state space X 

• a denoting an action from an action space A, and 

• w a denoting the weights of a function approximator 
which belong to the action a. 

The following adaptation step is executed 
during Q-learning in order to determine the optimum 
weights w a of the function approximator: 

« at + l = + Tfc • K K (d t ) • V Q (x t ; w^) 

with the abbreviation 

d t = r(x t/ a t , x t + i) + y max ofx t+ i, wf ) - Qfx t , w^l 

aeA 

• x t , x t +l respectively denoting a state in the state 
space X, 

• a t denoting an action from an action space A, 

• y denoting a prescribable reduction factor, 

• Wj* denoting the weighting vector associated with 
the action a t before the adaptation step, 

• w t+l denot; i- n< 3 tne weighing vector associated with 
the action a t after the adaptation step, 

• "Ht (t = 1, ...) denoting a prescribable step size 
sequence, 



GR 98 P 2663 

- 10 - 

• K e [-1; 1] denoting a risk monitoring parameter, 

• K K denoting a risk monitoring function K K (^) = 
(1 - Ksign<^))£, 

5 • VQ(-;-) denoting the derivation of the function 
approximator according to its weights, and 

• r(x t , a t , x t+1 ) denoting a gain upon the transition 
of state from the state x t to the subsequent state 

Xt+l- 

10 The optimization function is preferably formed 

in accordance with the following rule within the TD (A.) - 
learning method: 

OFTD = J(x;w) 

15 

• x denoting a state in a state space X, 

• a denoting an action from an action space A, and 

• w denoting the weights of a function approximator. 

20 The following adaptation step is executed 

during TD (A.) -learning in order to determine the optimum 
weights w of the function approximator: 

Wt+l = w t + r|t • K K (d t ) • z t 

25 

with the abbreviations 

d t = r(w t , a t , x t+ i) + yJ(x t+ i; w t ) - J(x t ; w t ) , 



30 z t = X ■ y • z^ + Vj(x t ; w t ) , 
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• x t/ x t+ i respectively denoting a state in the state 
space X, 

• a t denoting an action from an action space A, 
5 • y denoting a prescribable reduction factor, 

• w t denoting the weighting vector before the 
adaptation step, 

• w t+1 denoting the weighting vector after the 
adaptation step, 

10 • Tj t (t = 1, ...) denoting a prescribable step size 
sequence, 

• k e [-1; 1] denoting a risk monitoring parameter, 

• K K denoting a risk monitoring function & K (£,) = 
(1 - Ksign(^))^, 

15 • VJ(-;-) denoting the derivation of the function 
approximator according to its weights, and 

• r(x t , a-t, Xt+i) denoting a gain upon the transition 
of state from the state x t to the subsequent state 
x t +i- 

20 

The system is preferably a technical system of 
which before the determination measured values are 
measured which are used in determining the sequence of 
actions . 

25 The technical system can be subjected to open- 

loop control or else closed-loop control with the use 
of the determined sequence of actions. 

The system is preferably modeled as a Markov 
decision problem. 

30 The method or the arrangement is preferably 

used in a traffic management system or in a 
communications system, the sequence of actions being 
used in a communications network to carry out access 
control or a routing, that is to say a path allocation. 

35 Furthermore, the system can be a financial 

market which is modeled by a Markov decision problem, 
the change in the financial market, for example the 
change in an 
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index of stocks or else a rate of exchange on a foreign 
exchange market being analyzed by using the method 
and/or the arrangement and it being possible to 
intervene in the market in accordance with the sequence 
5 of determined actions. 

Exemplary embodiments of the invention are 
illustrated in the figures and explained in more detail 
below. 



Figure 1 shows a flowchart in which individual method 
steps of the first exemplary embodiment are 
illustrated; 

Figure 2 shows a sketch of a system which can be 
modeled as a Markov decision problem; 

Figure 3 shows a sketch of a communications network in 
which access control is carried out in a 
switching unit; 

Figure 4 shows a symbolic sketch of a function 
approximator with the aid of which a method 
of approximative dynamic programming is 
implemented; 

Figure 5 shows a further sketch of a plurality of 
function approximators with the aid of which 
approximative dynamic programming is 
implemented; and 

Figure 6 shows a sketch of a traffic management system 
which is subjected to closed-loop control in 
accordance with an exemplary embodiment. 
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First exemplary embodiment: access control and routing. 



Figure 3 shows a communications network 300, which has 
a multiplicity of switching units 301a, 301b, . . . , 
5 301i, . . . 301n, which are interconnected via 
connections 302a, 302b, 302 j, ... 302m. 

Furthermore, a first terminal 303 is connected 
to a first switching unit 301a. From the first terminal 
303, the first switching unit 301a is sent a request 
10 message 304 which requests preservation of a prescribed 
bandwidth within the communications network 300 for the 
purpose of transmitting data (video data, text data) . 

It is determined in the first switching unit 
301a in accordance with a strategy described below 
15 whether the requested bandwidth is available in the 
communications network 300 on a specified, requested 
connection (step 305) . 

The request is refused (step 306) if this is 
not the case. 

20 If sufficient bandwidth is available, it is 

checked in a further checking step (step 307) whether 
the bandwidth can be reserved. 

The request is refused (step 308) if this is 
not the case. 

25 Otherwise, the first switching unit 301a 

selects a route from the first switching unit 301a via 
further switching units 301i to a second terminal 309 
with which the first terminal 303 wishes to 
communicate, and a connection is initialized (step 



The starting point below is a communications 
network 300 which comprises a set of switching units 

N= {l,K , n,K , N} (17) 
and a set of physical connections 

L= {l,K , 1,K , L} , (18) 

a physical connection 1 having a capacity of B{1) 
bandwidth units . 
A set 

M= {l, K , m,K , M} (19) 

of different types of service m are available, a type 
of service m being characterized by 

• a bandwidth requirement b (m) , 

• an average connection time — - — and 

V(m) 

• a gain c (m) which is obtained whenever a call 
request of the corresponding type of service m is 
accepted. 

The gain c(m) is given by the amount of money 
which a network operator of the communications network 
300 bills a subscriber for a connection of the type of 
service. Clearly, the gain c (m) reflects different 
priorities, which can be prescribed by the network 
operator and which he associates with different 
services . 

A physical connection 1 can simultaneously 
provide any desired combination of communications 
connections as long as the bandwidth used for the 
communications connections does not exceed the 
bandwidth available overall for the physical 
connection. 
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If a new communications connection of type m is 
requested between a first node i and a second node j 
(terminals are also denoted as nodes) , the requested 
communications connection can, as represented above, 
5 either be accepted or be refused. 

If the communications connection is accepted, a 
route is selected from a set of prescribed routes. This 
selection is denoted as a routing, b (m) bandwidth units 
are used in the communications connection of type m for 
10 each physical connection along the selected route for 
the duration of the connection. 

Thus, during access control (call admission 
control), a route can be selected within the 
communications network 300 only when the selected route 
15 has sufficient bandwidth available. 

The aim of the access control and of the 
routing is to maximize a long term gain which is 
obtained by acceptance of the requested connections. 

At an instant t, the technical system which is 
20 the communications network 300 is in a state x t which 
is described by a list of routes via existing 
connections, by means of which lists it is shown how 
many connections of which type of service are using the 
respective routes at the instant t. 
25 Events w, by means of which a state x t could be 

transferred into a subsequent state x t+1 , are the 
arrival of new connection request messages, or else the 
termination of a connection existing in the 
communications network 300. 
30 In this exemplary embodiment, an action a t at 

an instant t owing to a connection request is the 
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decision as to whether a connection request is to be 
accepted or refused and, if the connection is accepted, 
the selection of the route through the communications 
network 300. 

The aim is to determine a sequence of actions, 
that is to say clearly to determine the learning of a 
strategy with actions relating to a state x t in such a 
way that the following rule is maximized: 



-g(x tk ,o k ,a tk )), 

Vk = 0 ' 



(20) 

Vk = 0 

• E{.} denoting an expectation, 

• t k denoting an instant at which a kth event takes 
place, 

• 9( x tk' co k' a tk) • denoting the gain which is associated 
with the kth event, and 

• P denoting a reduction factor which evaluates an 
immediate gain as being more valuable than a gain 
at instants lying further in the future. 

Different implementations of a strategy lead 
normally to different overall gains G: 



X e"^* • g(x tk , co k , a tjc 



(21) 



The aim is to maximize the expectation of the 
overall gain G in accordance with the following rule J: 
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J = e| X e'P** • g(x tk , <»k' a tk ) 
[k = 0 ■ 

it being possible to set a risk which reduces the 
overall gain G of a specific implementation of access 
control and of a routing strategy to below the 
5 expectation. 

The TD (X) -learning method is used to carry out 
the access control and the routing. 

The following target function is used in this 
exemplary embodiment: 

J*(x t ) = E T |e-P T j£ m jmaxjg(x t , eo t , a) + J*(x t+1 )]j, (23) 

10 

• A denoting an action space with a prescribed 
number of actions which are respectively available 
in a state x t , 

• t denoting a first instant at which a first event 
15 co occurs, and 

• x t +i denoting a subsequent state of the system. 

An approximated value of the target value 
J* (x t ) is learned and stored by employing a function 
approximator 400 (compare Figure 4 ) with the use of 

20 training data. 

Training data are data previously measured in 
the communications network 300 and relating to the 
behavior of the communications network 300 in the case 
of incoming connection requests 304 and of termination 

25 of messages. This time sequence of states is stored, 
and these training data are used to train the function 
approximator 4 00 in accordance with the learning method 
described below. 
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A number of connections of in each case one 
type of service m on a route of the communications 
network 300 serve in each case as input variable of the 
function approximator 400 for each input 401, 402, 403 
5 of the function approximator 400. These are represented 
symbolically in Figure 4 by blocks 404, 405, 406. 

An approximated target value J of the target 
value J* is the output variable of the function 
approximator 4 00. 

10 Figure 5 shows a detailed representation of the 

function approximator 500, which in this case has 
several component function approximators 510, 520 of 
the function approximator 500. One output variable is 
the approximated target value J , which is formed in 

15 accordance with the following rule: 

30*. e)- EiC)(|»e»). (24, 

The input variables of the component function 
approximators 510, 520, which are present at the inputs 
511, 512, 513 of the first component function 

20 approximator 510, or at the inputs 521, 522 and 523 of 
the second component function approximator 520 are, in 
turn, respectively a number of types of service of a 
type m in a physical connection r in each case, 
symbolized by blocks 514, 515, 516 for the first 

25 component function approximator, and 524, 525 and 52 6 
for the second component function approximator 520. 

Component output variables 530, 531, 532, 533 
are fed to an adder unit 540, and the approximated 
target variable J is formed as output variable of the 

30 adder unit. 

Let it be assumed that the communications 
network 300 is in the state x tk and that a request 

message with which a type of service m of class m is 
requested for a connection 

35 
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between two nodes i, j reaches the first switching unit 
301a. 

A list of permitted routes between the nodes i 
and j is denoted by R(i, j), and a list of all possible 
5 routes is denoted by 

R(i, j,x tk ) c R(i, j) (25) 

as a subset of the routes R(i, j) which could implement 
a possible connection with regard to the available and 
requested bandwidth. 
10 For each possible route r, reR(i, jjX^) , a 

subsequent state x tk + 

tk' ^k.*) is determined which 
results from the fact that the connection request 304 
is accepted and the connection on the route r is made 
available to the requesting first terminal 303. 
15 This is illustrated in Figure 1 as second step 

(step 102), the state of the system and the respective 
event being respectively determined in a first step 
(step 101) . 

A route r* to be selected is determined in a 
20 third step (step 103) in accordance with the following 
rule : 

r* = arg max j(*t k +l( x t k ' <°k' r ) ®t) • (26) 
reR(i, j,x tk J 

A check is made in a further step (step 104) as 
to whether the following rule is fulfilled: 

c(m) + j(x tjc+ i(x tk , oo k , r*), © t ) < j(x t]c , 0 t ). (27) 

25 
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If this is the case, the connection request 304 
is rejected (step 105) , otherwise the connection is 
accepted and "switched through" to the node j along the 
selected route r* (step 106) . 

Weights of the function approximator 400, 500 
which are adapted in the TD (A.) -learning method to the 
training data, are stored in a parameter vector 9 for 
an instant t in each case, such that an optimized 
access control and an optimized routing are achieved. 

During the training phase, the weighting 
parameters are adapted to the training data applied to 
the function approximator. 

A risk parameter k is defined with the aid of 
which a desired risk, which the system has with regard 
to a prescribed state owing to a sequence of actions 
and states, can be set in accordance with the following 
rules : 



-1 < k < 0: risky learning, 

k = o : neutral learning with regard to the 

risk, 

0 < K < 1: risk-avoiding learning, 

K = 1: worst-case learning. 



Furthermore, a prescribable parameter 0 < X < 1 
and a step size sequence y k are prescribed in the 
learning method. 

The weighting values of the weighting vector 0 
are adapted to the training data on the basis of each 
event co tk in accordance with the following adaptation 

rule : 

©k = ®k-i + rk* K (dkk, < 28 > 
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in which case 

d k = e-P(^- tk -l)(g(x tk/ a) k ,a tk ) + j(x tk > ©k-l)) - j(x tk _ 1 , © k _i) 



= Xe-P( t k-l- t k-2) 2t _ 1 + V 0 j(x tk _ lf 0 k _i) 



(29) 
(30) 



and 

K K ($) = (l - Ksign(4))^ . 



It is assumed that: z_ 
The function 



(32) 



g^ tk ,o* k ,a tk j 

denotes the immediate gain in accordance with the 
following rule: 



g(x tk ,(O k) a tk 



c (m) when co tk is a service request 

for a type of service m, and the 
connection is accepted 

0 otherwise (33) 



Thus, as described above, a sequence of actions 
is determined with regard to a connection request such 
that a connection request is either rejected or 
15 accepted on the basis of an action. The determination 
is performed taking account of an optimization function 
in which the risk can be set by means of a risk control 
parameter k e [-1; 1] in a variable fashion. 
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Second exemplary embodiment: Traffic management system 

Figure 6 shows a road 600 on which automobiles 
601, 602, 603, 604, 605 and 606 are being driven. 

Conductor loops 610, 611 integrated into the 
road 600 receive electric signals in a known way and 
feed the electric signals 615, 616 to a computer 620 
via an input/output interface 621. In an analog-to- 
digital converter 622 connected to the input/output 
interface 621, the electric signals are digitized into 
a time series and stored in a memory 623, which is 
connected by a bus 624 to the analog-to-digital 
converter 622 and a processor 625. Via the input/output 
interface 621, a traffic management system 650 is fed 
control signals 651 from which it is possible to set a 
prescribed speed stipulation 652 in the traffic 
management system 650, or else further particulars of 
traffic regulations, which are displayed via the 
traffic management system 650 to drivers of the 
vehicles 601, 602, 603, 604, 605 and 606. 

The following local state variables are used in 
this case for the purpose of traffic modeling: 

• traffic flow rate v, 

• vehicle density p (p = number of vehicles per 

Fz , 

kxlometer — ) . 

km 

• traffic flow q (q = number of vehicles per hour 
Fz 

— , (q= v * p) ) , and 
h 

• speed restrictions 652 displayed by the traffic 
management system 650 at an instant in each case. 

The local state variables are measured as 
described above by using the conductor loops 610, 611. 
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These variables (v(t>, p(t), q{t)) therefore 
represent a state of the technical system of "traffic" 
at a specific instant t. 

In this exemplary embodiment, the system is 
therefore a traffic system which is controlled by using 
the traffic management system 650. 

In this second exemplary embodiment, an 
extended Q-learning method is described as method of 
approximative dynamic programming. 

The state x t is described by a state vector 

x(t>= (v(t), p(t> q(t)) . (34) 

The action a t denotes the speed restriction 
652, which is displayed at the instant t by the traffic 
management system 650. 

The gain r(x t , a t , x t+ i) describes the quality of 
the traffic flow which was measured between the 
instants t and t+1 by the conductor loops 610 and 611. 
In this second exemplary embodiment, r{x t , a t , x t+i ) 
denotes 

• the average speed of the vehicles in the time 
interval [t, t+1] 

or 

• the number of vehicles which have passed the 
conductor loops 610 and 611 in the time interval 
[t, t + 1] 

or 

• the variance of the vehicle speeds in the time 
interval [t, t + 1] , 
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or 

• a weighted sum from the above variables. 

A value of the optimization function OFQ is 
determined for each possible action a t , that is to say 
for each speed restriction which can be displayed by 
the traffic management system 650, an estimated value 
of the optimization function OFQ being realized in each 
case as a neural network. 

This results in a set of evaluation variables 
for the various actions a t in the system state x t . 

Those actions a t for which the maximum 
evaluation variable OFQ has been determined in the 
current system state x t are selected in a control phase 
from the possible actions a t , that is to say from the 
set of the speed restrictions which can be displayed by 
the traffic management system 650. 

In accordance with this exemplary embodiment, 
the adaptation rule, known from the Q-learning method, 
for calculating the optimization function OFQ is 
extended by a risk control function K K (.), which takes 
account of the risk. 

In turn, the risk control parameter k is 
prescribed in accordance with the strategy from the 
first exemplary embodiment in the interval of 
[-1 < k < 1], and represents the risk which a user 
wishes to run in the application with regard to the 
control strategy to be determined. 

The following evaluation function OFQ is used 
in accordance with this exemplary embodiment: 




(35) 
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• x = (v; p; q) denoting a state of the traffic 
system, 

• a denoting a speed restriction from the action 
space A of all speed restrictions which can be 
displayed by the traffic management system 650, 
and 

• w a denoting the weights of the neural network 
which belong to the speed restriction a. 

The following adaptation step is executed in Q- 
learning in order to determine the optimum weights w a 
of the neural network: 
w *t +i . w? t +nt .KK (dt) . VQ ( Xt;w? t) (36 > 

using the abbreviation 

r(x t , a t , x t + i) + y max o(x t + i, w§) - o(x t , w^) (37) 



aeA 

x t , x t+ i denoting in each case a state of the 
traffic system in accordance with rule (34), 
a t denoting an action, that is to say a speed 
restriction which can be displayed by the traffic 
management system 650, 

Y denoting a prescribable reduction factor, 

wj* denoting the weighting vector belonging to 

the action a t , before the adaptation step, 

w^tj denoting the weighting vector belonging to 

the action a t , after the adaptation step, 

rit (t = 1, .-•) denoting a prescribable step size 

sequence, 
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• k e [-1; 1] denoting a risk control parameter, 

• K K denoting a risk control function K K (£) 
(1 - Ksign(^) )£, 

• V Q (-;-) denoting the derivative of the neural 
network with respect to its weights, and 

• r(x t , a t , x t+ i) denoting a gain upon the transition 
in state from the state x t to the subsequent state 

An action a t can be selected at random from the 
possible actions a t during learning. It is not 
necessary in this case to select the action a t which 
has led to the largest evaluation variable. 

The adaptation of the weights has to be 
performed in such a way that not only is a traffic 
control achieved which is optimized in terms of the 
expectation of the optimization function, but that also 
account is taken of a variance of the control results. 

This is particularly advantageous since the 
state vector x(t) models the actual system of traffic 
only inadequately in some aspects, and so unexpected 
disturbances can thereby occur. Thus, the dynamics of 
the traffic, and therefore of its modeling, depend on 
further factors such as weather, proportion of trucks 
on the road, proportion of mobile homes, etc., which 
are not always integrated in the measured variables of 
the state vector x(t) . In addition, it is not always 
ensured that the road users immediately implement the 
new speed instructions in accordance with the traffic 
management system. 

A control phase on the real system in 
accordance with the traffic management system takes 
place in accordance with the following steps: 
1. The state x t is measured at the instant t at 
various points in the traffic system of traffic 
and yields a state vector x(t): = (v(t), p(t), 
q(t) ) . 
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2. A value of the optimization function is determined 
for all possible actions a t , and that action a t 
with the highest evaluation in the optimization 
function is selected. 
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Patent claims 

1. A method for computer-aided determination of a 
sequence of actions for a system which has states, a 
transition in state between two states being performed 
on the basis of an action, in the case of which the 
determination of the sequence of actions is performed 
in such a way that a sequence of states resulting from 
the sequence of actions is optimized with regard to a 
prescribed optimization function, the optimization 
function including a variable parameter with the aid of 
which it is possible to set a risk which the resulting 
sequence of states has with respect to a prescribed 
state of the system. 

2. The method as claimed in claim 1, in which a 
method of approximative dynamic programming is used for 
the purpose of determination. 

3. The method as claimed in claim 2, in which the 
method of approximative dynamic programming is a method 
based on Q-learning. 

4. The method as claimed in claim 3, in which the 
optimization function OFQ is formed within Q-learning 
in accordance with the following rule: 

OFQ = q(x; w a ) , 

• x denoting a state in a state space X 

• a denoting an action from an action space A, and 

• w a denoting the weights of a function approximator 
which belong to the action a, 

and in which the weights of the function approximator 
are adapted in accordance with the following rule: 
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K K (d t ).Vo(xt;w?t) 



with the abbreviation 




» x t , x t +l respectively denoting a state in the state 
space X, 

» a t denoting an action from an action space A, 

» y denoting a prescribable reduction factor, 
» w^t denoting the weighting vector associated with 

the action a t before the adaptation step, 

• w t+l denotin< ? the wei g hin< 3 vector associated with 
the action a t after the adaptation step, 

• tit (t = 1, denoting a prescribable step size 
sequence, 

• k e [-1; 1] denoting a risk monitoring parameter, 

• K K denoting a risk monitoring function K K (§) = 
(1 - Ksign(^) 

• VQ{-;-) denoting the derivation of the function 
approximator according to its weights, and 

• r(x t , a t , x t +i) denoting a gain upon the transition 
of state from the state x t to the subsequent state 
x t +i- 

5. The method as claimed in claim 2, in which the 
method of approximative dynamic programming is a method 
based on TD (X) -learning . 

6. The method as claimed in claim 5, in which the 
optimization function OFTD is formed within TD (X) - 
learning in accordance with the following rule: 
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OFTD = J(x;w) 

• x denoting a state in a state space X, 

• a denoting an action from an action space A, and 

• w denoting the weights of a function approximator 
and in which the weights of the function approximator 
are adapted in accordance with the following rule: 

w t+ i = w t + Tj t • K K (d t ) • z t 

with the abbreviations 

d t = r(w t , a t , x t+1 ) + yJ(x t+1 ; w t ) - J(x t ; w t ) , 
z t = X • Y ' z t-i + VJ(x t ; w t ) , 



x t , x t+i respectively denoting a state in the state 
space X, 

a t denoting an action from an action space A, 

Y denoting a prescribable reduction factor, 

w t denoting the weighting vector before the 

adaptation step, 

w t+ i denoting the weighting vector after the 
adaptation step, 

r| t (t = 1, ...) denoting a prescribable step size 
sequence, 

k e [-1; 1] denoting a risk monitoring parameter, 
K K denoting a risk monitoring function K K (^) = 
(1 - Ksign(^))4, 

VJ(-;-) denoting the derivation of the function 
approximator according to its weights, and 
r(x t , a t , x t+ i) denoting a gain upon the transition 
of state from the state x t to the subsequent state 

Xt+l- 
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7. The method as claimed in one of claims 1 to 6, 

in which the system is a technical system of which 
before the determination measured values are measured 
which are used in determining the sequence of actions. 
5 8. The method as claimed in claim 7, in which the 

technical system is subjected to open-loop control in 
accordance with the sequence of actions. 

9. The method as claimed in claim 7, in which the 
technical system is subjected to closed-loop control in 

10 accordance with the sequence of actions. 

10. The method as claimed in one of claims 1 to 9, 
in which the system is modeled as a Markov decision 
problem. 

11. The method as claimed in one of claims 1 to 10, 
15 being used in a traffic management system. 

12. The method as claimed in one of claims 1 to 10, 
being used in a communications system. 

13. The method as claimed in one of claims 1 to 10, 
being used to carry out access control in a 

2 0 communications network. 

14. The method as claimed in one of claims 1 to 10, 
being used to carry out a routing in a communications 
network. 

15. An arrangement for determining a sequence of 
25 actions for a system which has states, a transition in 

state between two states being performed on the basis 
of an action, 
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having a processor which is set up in such a way that 
the determination of the sequence of actions can be 
performed in such a way that a sequence of states 
resulting from the sequence of actions is optimized 
5 with regard to a prescribed optimization function, the 
optimization function including a variable parameter 
with the aid of which it is possible to set a risk 
which the resulting sequence of states has with respect 
to a prescribed state of the system. 
10 16. The arrangement as claimed in claim 15, being 

used to subject a technical system to open-loop 
control . 

17. The arrangement as claimed in claim 15, being 
used to subject a technical system to closed-loop 

15 control. 

18. The arrangement as claimed in claim 15, being 
used in a traffic management system. 

19. The arrangement as claimed in claim 15, being 
used in a communications system. 

20 20. The arrangement as claimed in claim 15, being 

used to carry out access control in a communications 
network . 

21. The arrangement as claimed in claim 15, being 

used to carry out a routing in a communications 
25 network. 
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Abstract 

Method and arrangement for determining a sequence of 
actions for a system which has states , a transition in 
state between two states being performed on the basis 
of an action 

The determination of a sequence of actions is 
performed in such a way that a sequence of states 
resulting from the sequence of actions is optmized with 
regard to a prescribed optimization function. The 
optimization function includes a variable parameter 
with the aid of which it is possible to set a risk 
which the resulting sequence of states has with respect 
to a prescribed state of the system. 
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Erklarung Fur Patentanmeldungen Mit Vollmacht 

German Language Declaration 



Als nachstehend benannter Erfinder erklare ich hiermit 
Eides Statt: 



dass mein Wohnsitz, meine Postanschrift, und meine 
Staatsangehorigkeit den im Nachstehenden nach 
einem Namen aufgefuhrten Angaben entsprechen, 



dass ich, nach bestem Wissen der ursprungliche, 
erste und alleinige Erfinder (falls nachstehend nur ein 
Name angegeben ist) oder ein ursprunglicher, erster 
und Miterfinder (falls nachstehend mehrere Namen 
aufgefuhrt sind) des Gegenstandes bin, fur den dieser 
Antrag gestellt wird und fur den ein Patent beantragt 
wird fur die Erfindung mit dem Titel: 

Verfahren und Anordnunq zur Ermittlung 



einer Foiqe von Aktionen fur ein System, 



welches Zustande aufweist, wobei ein Zu- 



standsuberqang zwischen zwei Zustanden 



aufgrund einer Aktion erfolgt 



As a beiow named inventor, I hereby declare that: 



My residence, post office address and citizenship s 
as stated below next to my name, 



I believe I am the original, first and sole inventor (if 
only one name is listed below) or an original, first and 
joint inventor (if plural names are listed below) of the 
subject matter which is claimed and for which a patent 
is sought on the invention entitled 



deren Beschreibung 

(zutreffendjhaes ankreuzen) 

0<3 hier beigefugt ist. 

□ am als 

PCT internationale Anmeldung 

PCT Anmeldungsnummer 

Eingereicht wurde und am 

Abgeandert wurde (falls tatsachlich abgeandert). 



the specification of which 

(check one) 

CH is attached hereto. 

□ was filed on as 

PCT international application 

PCT Application No. 

and was amended on 

(if applicable) 



Ich bestatige hiermit, dass ich den Inhalt der obigen 
Patentanmeldung einschliesslich der Anspruche 
durchgesehen und verstanden habe, die eventuell 
durch einen Zusatzantrag wie oben erwahnt abgean- 
dert wurde. 



I hereby state that I have reviewed and understand the 
contents of the above identified specification, inclu- 
ding the claims as amended by any amendment refer- 
red to above. 



Ich erkenne meine Pflicht zur Offenbarung irgendwel- 
cher Informationen, die fur die Prufung der vorliegen- 
den Anmeldung in Einklang mit Absatz 37, Bundes- 
gesetzbuch, Paragraph 1.56(a) von Wichtigkeit sind, 
an. 



I acknowledge the duty to disclose information which 
is material to the examination of this application in 
accordance with Title 37, Code of Federal Regula- 
tions, §1. 56(a). 



Ich beanspruche hiermit auslandische Prioritatsvor- 
teile gemass Abschnitt 35 der Zivilprozessordnung der 
Vereinigten Staaten, Paragraph 119 aller unten ange- 
gebenen Auslandsanmeldungen fur ein Patent oder 
eine Erfindersurkunde, und habe auch alle Auslands- 
anmeldungen fur ein Patent oder eine Erfindersurkun- 
de nachstehend gekennzeichnet, die ein Anmelde- 
datum haben, das vor dem Anmeldedatum der An- 
meldung liegt, fur die Prioritat beansprucht wird. 



I hereby claim foreign priority benefits under Title 35, 
United States Code, §119 of any foreign application(s) 
for patent or inventor's certificate listed below and 
have also identified below any foreign application for 
patent or inventor's certificate having a filing date 
before that of the application on which priority is clai- 
med: 
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German Language Declaration 




Prior foreign appplications 
Prioritat beanspmcht 






Priority Claimed 


198 43 620.3 Germany 


23.September 1 


998 


m □ 


(Number) (Country) 
(Nummer) (Land) 


(Day Month Year Filed) 
(Tag Monat Jahr eingereicht) 


Yes No 
Ja Nein 








□ □ 


(Number) (Country) 
(Nummer) (Land) 


(Day Month Year Filed) 
(Tag Monat Jahr eingereicht) 


Yes No 
Ja Nein 








□ □ 


(Number) (Country) 
(Nummer) (Land) 


(Day Month Year Filed) 
(Tag Monat Jahr eingereicht) 


Yes No 
Ja Nein 


Ich beanspruche hiermit gemass Absatz 35 der Zivil- 
prozessordnung der Vereinigten Staaten, Paragraph 
120, den Vorzug aller unten aufgefuhrten Anmel- 
dungen und falls der Gegenstand aus jedem Anspruch 
dieser Anmeldung nicht in einer fruheren 
amerikanischen Patentanmeldung laut dem ersten 
Paragraphen des Absatzes 35 der Zivilprozeliordnung 
der Vereinigten Staaten, Paragraph 122 offenbart ist, 
erkenne ich gemass Absatz 37, Bundesgesetzbuch, 
Paragraph 1.56(a) meine Pflicht zur Offenbarung von 
Informationen an, die zwischen dem Anmeldedatum 
der fruheren Anmeldung und dem nationalen oder 
PCT internationalen Anmeldedatum dieser Anmeldung 
bekannt geworden sind. 


I hereby claim the benefit under Title 35. United States 
Code. §120 of any United States application(s) listed 
below and, insofar as the subject matter of each of the 
claims of this application is not disclosed in the prior 
United States application in the manner provided by 
the first paragraph of Title 35, United States Code, 
§122, I acknowledge the duty to disclose material 
information as defined in Title 37, Code of Federal 
Regulations, §1 .56(a) which occured between the 
filing date of the prior application and the national or 
PCT international filing date of this application. 


(Application Serial No.) 
(Anmeldeseriennummer) 


(Filing Date) 
(Anmeldedatum) 


(Status) 

(patentiert, anhangig, 
aufgegeben) 


(Status) 

(patented, pending, 
abandoned) 


(Application Serial No.) 
(Anmeldeseriennummer) 


(Filing Date) 
(Anmeldedatum) 


(Status) 

(patentiert, anhangig, 
aufgeben) 


(Status) 

(patented, pending, 
abandoned) 


Ich erklare hiermit, dass alle von mir in der vorliegen- 
den Erklarung gemachten Angaben nach meinem 
besten Wissen und Gewissen der vollen Wahrheit 
entsprechen, und dass ich diese eidesstattliche Erkla- 
rung in Kenntnis dessen abgebe, dass wissentlich und 
vorsatzlich falsche Angaben gemass Paragraph 1001, 
Absatz 18 der Zivilprozessordnung der Vereinigten 
Staaten von Amerika mit Geldstrafe belegt und/oder 
Gefangnis bestraft werden koennen, und dass derartig 
wissentlich und vorsatzlich falsche Angaben die Gul- 
tigkeit der vorliegenden Patentanmeldung oder eines 
darauf erteilten Patentes gefahrden konnen. 


I hereby declare that all statements made herein of my 
own knowledge are true and that all statements made 
on information and belief are believed to be true, and 
further that these statements were made with the 
knowledge that willful false statements and the like so 
made are punishable by fine or imprisonment, or both, 
under Section 1001 of Title 18 of the United States 
Code and that such willful false statements may 
jeopardize the validity of the application or any patent 
issued thereon. 
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tentanwalte) und/oder Patent-Agenten mit der Verfol- 
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der Abwicklung aller damit verbundenen Geschafte 
vor dem Patent- und Warenzeichenamt: (Name und 
Registrationsnummer anfuhren) 



POWER OF ATTORNEY: As a named inventor, I 
hereby appoint the following attomey(s) and/or 
agent(s) to prosecute this application and transact all 
business in the Patent and Trademark Office connec- 
ted therewith, (list name and registration number) 



And I hereby appoint 

Messrs. John D. Simpson (Registration No. 19,842) Lewis T. Steadman (17,074), William C. Stueber (16,453), P. Phillips Connor (19,259), Dennis A. Gross 
(24,410), Marvin Moody (16,549), Steven H. Noll (28,982), Brett A. Valiquet (27,841), Thomas I. Ross (29,275), Kevin W. Guynn (29,927), Edward A. Lehmann 
(22,312), James D. Hobart (24,149), Robert M. Barrett (30,142), James Van Santen (16,584), J. Arthur Gross (13,615), Richard J. Schwarz (13,472) and 
Melvin A. Robinson (31,870), David R. Metzger (32,919), John R. Garrett (27,888) all members of the firm of Hill, Steadman & Simpson, A Professional Corpo- 
ration. 



Telefongesprache bitte richten an: Direct Telephone Calls to: (name and telephone num- 

(Name und Telefonnummer) ber) 

312/876-0200 

Ext. 



Postanschrift: Sen d Correspondence to: 

HILL, STEADMAN & SIMPSON 
A Professional Corporation 
85th Floor Sears Tower, Chicago, Illinois 60606 



Voller Name des einzigen oder urspriinglichen Erfinders: 


Full name of sole or first inventor: 


Unterschrifties |rffi?ders Datum 




Inventor's signature Date 


D-&H667 Munchen, Germany ' \ ~) < tJ- 




StaatsangehoTigEeif - ■ 


Citizenship 


Postanschrift 

Gravelottestr. 3 


Post Office Addess 


D-81667 Munchen 
Bundesrepublik Deutschland 




Voller Name des zweiten Miterfinders (falls zutreffend): 

MIHATSCH, Oliver 


Full name of second joint inventor, if any: 


Unterscjuift dos Erfinders Datum 

uAiiL^cit i*.?<?f 


Second Inventor's signature Date 


Wohn^tz 

D-80634JV!un£)aen r .Germany \~~^f£> — 


Residence 


Staatsangehorigkeil 

Bundesrepublik Deutschland 


Citizenship 


Postanschrift 

Schulstr. 31 


Post Office Address 


D-80634 Munchen 
Bundesrepublik Deutschland 





{Bitte entsprechende Informationen und Unterschriften im (Supply similar information and signature for third and 



Falle von dritten und weiteren Miterfmdern angeben). subsequent joint inventors). 
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